1 Introduction

This manual serves as a companion document to DEQ’s Water Quality Assessment (WQA) automated assessment methodology for completion of the biennial 305b/303d Integrated Report. Virginia is among a handful of state agencies that have organized concerted efforts to systematically automate and expedite the water quality assessment process. These entities have all taken different approaches to best meet their state-specific analysis and reporting needs. Through the Tools for Automated Data Assessment (TADA) working group, EPA is organizing an effort to share code and create national tools to assist partners with assessment analyses. Considering the technological advances in water monitoring that have increased the quantity and types of data requiring consideration during an assessment period, as well as the static or decreasing staff time allocated to the assessment process, automation appears the only solution to keep up with Clean Water Act requirements and keep the public informed of the status of waterbody health and restoration timelines for those waterbodies not meeting water quality standards. By embracing automated assessment procedures, entities aim to provide constituents with higher quality, standardized, and transparent water quality assessment results while adhering to federally mandated deadlines.

Virginia has taken a hybrid approach to the process of automating the water quality assessments. DEQ relies upon an automated process to manipulate hundreds of thousands of water quality data collected throughout the state during a given assessment period (six year window). These scripts evaluate each record for exceedances of appropriate Water Quality Standards. Regional assessment staff may use interactive web-based analytical tools provide more context to tabular results of exceedance analyses to assist with Quality Assurance/Quality Control (QA/QC) prior to submitting data to EPA’s Assessment, Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) system through DEQ’s internal Comprehensive Environmental Data System (CEDS). This approach maximizes the benefits of each partner in the assessment process, allowing computers to excel at systematic and expedited data manipulation and analysis at scale and allowing humans to digest multiple lines of evidence to identify any potential data collection or analysis errors or areas of environmental concern.

1.1 How to Use Document

The following report is an agency effort to facilitate the further adoption of automated assessment methods statewide. The intention of this project is to provide a non-programmer audience with accessible and understandable narratives describing the automated water quality assessment process. This document is not a comprehensive WQA Guidance Manual, nor is it an introduction to using the R programming language. Here, we document the overall automated assessment process, explaining reasons certain decisions were made, how to unpack analyses, and where to seek further assistance through additional resources or interactive data exploration through web-based application interfaces.

1.1.1 Programming Expectations

The Water Quality Assessment process is complicated. Learning a programming language is complicated. Combine these topics, the subject of this book, and we arrive at complicated2.

The narrative script descriptions are written for understanding at an elementary programming level; however, the individual scripts utilize advanced programming techniques for parsimony, efficiency, and ease of long term maintenance. While it is not necessary to understand every element of code provided to run the automated assessment methods, certain programming themes listed below can be helpful:

The internally-published DEQ R Methods Encyclopedia is a good resource when getting started with R. These articles are useful to familiarize yourself:

The WQA Technical Support Team is available for assistance should you encounter any questions.

1.2 Project History

The agency began efforts toward automating components of the IR with a dissolved metals assessment written in SAS. These tabular results were provided to regional assessment staff along with raw data queries encompassing data stored in CEDS for each IR data window.

The Freshwater Probabilistic Monitoring program began automating analyses and a final report for inclusion in the 2016 IR using the open source R programming language. These practices have been further refined each IR cycle, improving report quality and graphics as well as significantly reducing the amount of time required to generate a report after data collection. These methods were identified as a possible solution to the ongoing challenges in the WQA program to complete vast amounts of data analysis on increasingly short timelines with limited staff.

An effort to systematically analyze all water quality data stored in CEDS for assessments began in 2017. As a pilot project, lakes and reservoirs in the Blue Ridge Regional Office (BRRO) assessment region were selected for first waterbody type to undergo automation and receive an interactive web-based tool to assist regional assessment staff for the 2018 IR. These initial automated assessment efforts were completed using the open source R programming language and interactive applications were built using the Shiny package. Rivers and streams followed suit for the 2020 IR, joining lacustrine waterbodies with automated assessment methods and interactive assessment tools. The 2020 IR automated methods were scaled from just the BRRO region to statewide applicability. A pilot project to incorporate citizen monitoring data requiring assessment was undertaken in the BRRO assessment region for the 2020 IR. This pilot proved effective in standardizing and increasing efficiency of incorporating this disparate data and was officially adopted as a process for future IR windows.

After thorough QA from regional assessment staff across the state, the riverine and lacustrine automated assessment tools were rebooted for the 2022 IR with increased functionality and the ability to assess more parameters, including those not stored in CEDS. However, due to delays organizing citizen monitoring data statewide, automated results for these data were not provided for the 2022 IR. A new database schema for archiving station-specific water quality standards and assessment unit information was implemented. This system benefited the assessment process as well as numerous DEQ water programs that previously did not have access to WQS information at that spatial scale. Appendices and fact sheets for the IR were generated using R and Rmarkdown for the first time during the 2022 IR.

The 2024 IR further refines improvements to the riverine and lacustrine automated assessment methods and interactive tools. By partnering with the Chesapeake Monitoring Cooperative (CMC), DEQ leveraged an existing public-facing citizen monitoring data portal to expand utility outside the Chesapeake Bay watershed and incorporate all of Virginia’s citizen monitoring data. These data are now automatically cleaned and stored by the CMC and can more easily be incorporated in the automated assessment process with web scraping techniques.

To date, no estuarine-specific assessment methods have been completed, but the WQA program is investigating utility and potential adoption among regions with estuarine waters.

1.3 WQA Technical Support Team

Should you encounter any techical issues running the included scripts or have questions about the automated assessment methods described, please reach out to these technical staff for assistance:

1.4 Acknowledgements

Many Water Quality Assessment (WQA) and Water Quality Standards (WQA) staff have contributed to this effort:

Please direct any project questions to Emma Jones ().

2 Data Organization

The assessment process requires hundreds of thousands of rows of data collected and stored by DEQ, other state agencies, and citizen partners. These disparate data sources require various levels of data manipulation prior to any analysis. SAS and R are used for the majority of the assessment data cleaning and manipulation processes.

2.0.0.1 Data Location

Most data used for assessments are stored in DEQ’s internal Comprehensive Environmental Data System (CEDS) and made available through a direct connection to the reporting database known as ODS. The agency is working towards storing all data required for assessments in CEDS, but as of the time of writing the following datasets are stored in locations outside of CEDS:

  • Fish Tissue Data
  • PCB Data
  • VDH Data
  • Citizen Monitoring Data
  • Station Metadata

It is important to note that although the data that consitute the conventionals dataset are derive from CEDS/ODS, official assessment records of this data are only stored locally in Microsoft Excel outputs.

2.0.0.2 Data Availablility

Most of the following data are provided at the beginning of the assessment process (approximately March of an assessment year). Any delays in data availability have ripple effects on the ability of regional assessors to complete their work on time. Should data be provided for assessment after the expected availability date, assessors may not be able to include said data in a given assessment window.

Exceptions to the March data availability date usually apply to Citizen Monitoring, bioassessment, and fish tissue data. Citizen Monitoring data have historically been provided to regional assessment staff in April of a given assessment year. Bioassessment data for the most recent two years of an assessment window trickles in to the assessment process through the summer of a given assessment year due to the lengthy identification process associated with benthic macroinvertebrate samples. Fish tissue data requires protracted laboratory analyses before it is provided back to DEQ from contractor labs for assessment purposes.

Due to these data delays, regional assessment staff generally have to delay certain assessment steps until all data is available for a given station/Assessment Unit, making automated assessment methods ever more important when data are provided.

2.1 Conventionals Data

The conventionals dataset contains the bulk of the data analyzed during any given assessment window. Historically, this dataset has been queried using SAS and a direct connection to the raw monitoring data in the ODS reporting database (the database that makes CEDS data accessible to reporting tools). Recently, efforts to streamline the assessment process and standardize data provided across agency programs produced an effort to convert these SAS scripts into R.

The “conventionals query” combines WQM field and analyte data with data handling steps that standardize data discrepancies like multiple lab values returned for identical sample date/times, full parameter speciation but no total value, etc. This R query and data standardization method is considered to be under development as code review is a constant part of data improvement strategies.

The current version of the R based conventionals query is available here. In order to access the data the conventionals function calls, users must have a direct connection to ODS from their environment. Please see the DEQ R Methods Encyclopedia article on ODS for more information.

Once the official conventionals dataset is provided for a given assessment window, it is stored on the R server as a pinned dataset to preserve a standardized data version (the data of record) in a centralized location accessible by all DEQ staff. This expedites the amount of time it takes to pull the data into an R environment as well as improves assessment application rendering time.

An example conventionals dataset is available on the R server and may be retrieved using the script below. Please see the DEQ R Methods Encyclopedia article on pinned data sources for more infomation on how to connect your local system to the R server to query this data. This version of the dataset is used for feature enhancements and application development for the IR2024 cycle. However, this is not the official IR2024 conventionals dataset.

conventionals <- pin_get('ejones/conventionals2024draft', board = 'rsconnect')
conventionals[1:50,] %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

2.2 Water Column Metals

query from Roger, for now

2.3 Sediment Metals

query from Roger, for now

2.4 Fish Tissue Data

provided by Rick & Gabe, for now

2.5 PCB Data

From Mark Richards, not in CEDS

2.6 VDH Data

Beach closure, HAB event

2.7 Citizen Monitoring Data

Citizen Monitoring data have historically been provided to DEQ in various data formats and schema using numerous digital and analog storage methods. This data system required multiple iterations of lengthy data standardization and QA/QC processes in order to ensure the data were utilized for assessments. A standardized system requiring citizen groups to either upload their data to the Chesapeake Monitoring Cooperative (CMC) Data Explorer and DEQ scraping the CMC API using automated R scripts or provide all data to DEQ in a standardized template has replaced previous data receiving methods. Citizen data not stored in the CMC database live on DEQ staff computers and require further storage solutions.

A sample citizen monitoring dataset is available in the exampleData directory for you to download and use when practicing the automated scripts.

citmon <-  readRDS('exampleData/citmon.RDS')
citmon[1:50,] %>%  # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

2.8 Station Metadata

Station metadata are critical to ensuring that stations are assessed appropriately whether an assessment is conducted by hand or with automation. These metadata include the appropriate Assessment Unit and Water Quality Standard information that apply to the station. After this information is identified for a given station, it may be assessed according the the assessment rules that apply to those standards for the specified waterbody type.

Metadata are provided for stations through a combination of automated spatial analyses and human review. No metadata are ever linked to stations without regional assessor review. The Metadata Attribution section covers the details of attributing metadata to each station.

2.9 Low Flow (7Q10) Data

The 2024 IR is the first assessment window to use a standardized low flow analysis process. Information presented below is still under review by regional assessment staff.

The workflow first analyzes a 7Q10 low flow statistic for all available gages in VA based on the last 50 years of water data. The functions called to analyze the xQy flow statistics are identical to DEQ’s Water Permitting protocols and were written by Connor Brogan. The water data is provided by the USGS NWIS data repository and is called in the xQy function by the USGS dataRetreival package. Important assumptions of the xQy program are identified below.

Once flow statistics are generated for all available gages statewide, this information is compared to available flow data during a given assessment window. Any gages identified below the 7Q10 statistic are flagged for the appropriate time period. This information is spatially joined to a major river basin layer to extrapolate available flow data to areas without gaging stations. This is not an ideal extrapolation of flow data, but it serves as a decent initial flag for assessors to know when/where to investigate further.

These temporal low flow flags are joined to individual site monitoring data by major river basin during the automated assessment process. If parameters used to assess aquatic life condition are collected during low flow periods, then the data are flagged inside the assessment applications, indicating further review is necessary prior to accepting the automated assessment exceedance calculations for that site.

2.9.1 7Q10 Method

The method for identifying low flow information for the assessment period is detailed below. The DFLOW_CoreFunctions_EVJ.R script is an assessment-specific adaptation of Connor Brogan’s xQy protocols that allow for minor adjustments to the DFLOW procedure to allow for assessment purposes (for more information on these changes, see the Important 7Q10 Calculation Notes section below.

This analysis needs to be performed on or after April 2 of each assessment window cutoff to ensure the entire final water year is included in the analysis. The results are posted on the R server for inclusion in the automated assessment methods.

library(tidyverse)
library(zoo)
library(dataRetrieval)
library(e1071)
library(sf)
library(leaflet)
library(inlmisc)
library(DT)

source('DFLOW_CoreFunctions_EVJ.R')

2.9.2 USGS Site Data Gathering

All USGS gage information sampled in the last 50 years need to be collected from USGS NWIS. We can use the whatNWISsites() function to identify which sites have daily discharge data (00060) in a designated area (stateCd = ‘VA’).

sites <- whatNWISsites(stateCd="VA",
                       parameterCd="00060",
                       hasDataTypeCd="dv") %>% 
  filter(site_tp_cd %in% c('ST', 'SP')) # only keep ST (stream) and SP (spring) sites

sites_sf <- sites %>% 
   st_as_sf(coords = c("dec_long_va", "dec_lat_va"), 
               remove = F, # don't remove these lat/lon cols from df
               crs = 4326) # add projection, needs to be geographic for now bc entering lat/lng

Now we will pull daily flow data for each site identified and calculate 7Q10. This is saved in the local environment as a list object with each gage a unique list element.

# store it somewhere
flowAnalysis <- list()

for(i in unique(sites$site_no)){
  print(i)
  siteFlow <- xQy_EVJ(gageID = i,#USGS Gage ID
    DS="1972-03-31",#Date to limit the lower end of usgs gage data download in yyyy-mm-dd
    DE="2023-04-01",#Date to limit the upper end of USGS gage data download in yyyy-mm-dd
    WYS="04-01",#The start of the analysis season in mm-dd. Defaults to April 1.
    WYE="03-31",#The end of the analysis season in mm-dd. Defaults to March 31.
    x=7,#If you want to include a different xQY then the defaults, enter x here
    y=10,
    onlyUseAcceptedData = F )
  
  flowAnalysis[i] <- list(siteFlow)
}

Using the purrr library, we can extract just the flow metric information for each gage and store in a tibble for use later.

# extract 7Q10 by gageNo
x7Q10 <- map_df(flowAnalysis, "Flows") # EVJ added in gageNo to xQy_EVJ()

x7Q10[1:50,] %>% # preview first 50 rows
    DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

And we need the actual daily flow data to compare to the low flow metrics, so we will extract that next.

# now to extract flow data already pulled by function
flows <- map_df(flowAnalysis, "outdat") 

flows[1:50,] %>% # preview first 50 rows
    DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

Since we only really care about flow data from our assessment window, let’s extract just the flow data and filter to our IR window of interest. We can then join the low flow metrics by gage number and flag any daily average flow data that falls below the gage’s 7Q10 metric.

# now just grab flow data in assessment window, join in 7Q10, identify any measures below 7Q10
assessmentFlows <- map_df(flowAnalysis, "outdat") %>% 
  filter(between(Date, as.Date("2017-01-01"), as.Date("2022-12-31"))) %>% 
  left_join(x7Q10, by = c('Gage ID' = "gageNo")) %>% 
  mutate(`7Q10 Flag` = case_when(Flow <= n7Q10 ~ '7Q10 Flag',
                                 TRUE ~ as.character(NA)))

assessmentFlows[1:50,] %>% # preview first 50 rows
    DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

Here we limit our assessmentFlows object to just the rows where a 7Q10 flag is encountered. We can review these low flow events by organizing them by gage and date.

# anything below 7Q10?
lowAssessmentFlows <- filter(assessmentFlows, `7Q10 Flag` == '7Q10 Flag')
unique(lowAssessmentFlows$`Gage ID`) # what gages do these occur at?

# organize low flow events by Gage ID
lowAssessmentFlows %>% 
  arrange(`Gage ID`, Date))[1:50,] %>% # preview first 50 rows
    DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

Next, let’s review the low flow gages visually on a map. First, we need to transform this low flow information into a spatial object.

# see where spatially
lowFlowSites <- lowAssessmentFlows %>% 
  distinct(`Gage ID`) %>% 
  left_join(sites_sf, by = c('Gage ID' = 'site_no')) %>% 
  st_as_sf()

We will bring in major river subbasins to better understand how these low flow events happen across the landscape.

basins <- st_as_sf(pin_get("ejones/DEQ_VAHUSB_subbasins_EVJ", board = "rsconnect")) %>% 
  group_by(BASIN_CODE, BASIN_NAME) %>% 
  summarise()

And here is a map of the major river subbasins with all Virginia USGS gages (USGS sites) and just USGS gages with low flow events in the IR window (Low Flow USGS sites).

CreateWebMap(maps = c("Topo","Imagery","Hydrography"), collapsed = TRUE,
             options= leafletOptions(zoomControl = TRUE,minZoom = 5, maxZoom = 20,
                                     preferCanvas = TRUE)) %>%
  setView(-79.1, 37.7, zoom=7)  %>%
  
  addPolygons(data= basins,  color = 'black', weight = 1,
              fillColor= 'blue', fillOpacity = 0.4,stroke=0.1,
              group="Major River SubBasins", label = ~BASIN_NAME) %>% 
  addCircleMarkers(data = sites_sf, color='gray', fillColor='gray', radius = 4,
                   fillOpacity = 0.8,opacity=0.8,weight = 2,stroke=T, group="USGS sites",
                   label = ~site_no)  %>%
  addCircleMarkers(data = lowFlowSites, color='gray', fillColor='red', radius = 4,
                   fillOpacity = 0.8,opacity=0.8,weight = 2,stroke=T, group="Low Flow USGS sites",
                   label = ~`Gage ID`)  %>% 

  addLayersControl(baseGroups=c("Topo","Imagery","Hydrography"),
                   overlayGroups = c("Low Flow USGS sites","USGS sites","Major River SubBasins"),
                   options=layersControlOptions(collapsed=T),
                   position='topleft')

Now we need to join the watershed information to the low flow analysis so we can easily incorporate this information to all monitoring sites that fall in the watershed.

lowFlowSitesHUC <- st_intersection(lowFlowSites, basins) %>%  
  st_drop_geometry() %>% # for this analysis we don't actually need the spatial information
  dplyr::select(`Gage ID` = `Gage.ID`, # spatial joins change tibble names, changing back to name we want
                agency_cd:dec_long_va, BASIN_CODE, BASIN_NAME)

lowAssessmentFlows <- left_join(lowAssessmentFlows, lowFlowSitesHUC, by = 'Gage ID') %>% 
  dplyr::select(Agency, `Gage ID`, `Station Name` = station_nm, `Site Type` = site_tp_cd,
                Latitude = dec_lat_va, Longitude = dec_long_va, Date:Status, 
                `Water Year` = WY, n7Q10, `7Q10 Flag`,BASIN_CODE, BASIN_NAME)

This information is pinned to the R server so we can use it during the automated assessment process.

pin_get('ejones/AssessmentWindowLowFlows', 'rsconnect')[1:50,] %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

2.9.3 Important 7Q10 Calculation Notes

Information from the chief flow statistic programmer (Connor Brogan) about the assumptions of the 7Q10 function used in this analysis:

  1. Desired exceedance probability evaluation in check.fs() must be between 0 and 1
  2. Only complete analysis years are included for calculation of both the Harmonic Mean and all low flows. Years with minimum flow of 0 are removed, but compensated for via USGS probability adjustment
  3. Must have at least 2 analysis years of complete data (line 133) to calculate xQy flows (not necessary for Harmonic Mean)
  4. Provisional gage flow data is removed
  5. Negative flows (e.g. tidally reversed) are treated as NA following the USGS SW Toolbox
  6. Function only uses gage data where the water year is within the range of the years of DS and DE
  7. The gage data is filtered to only include data after the first date of the first analysis season and before the last date of the last analysis season. For instance, if WYS = “04-01” and WYE = “03-31” and DS and DE were 1972 to 2022, then the gage data would be limited to the dates between and including 1972-04-01 and 2022-03-31
  8. The start and end dates of the data must include one at least one instance of WYE
  9. The last few days in the analysis season are removed to ensure statistical independence between years
  10. Analysis season must have sufficient days to calculate a minimum flow such that at least 7-days are required to calculate a 7-day flow

Based on the changes Emma Jones made to the original function for Water Quality Assessment purposes, here are important assumptions to know (numbered according to system above):

  1. Must have at least 10 analysis years of complete data to calculate xQy flows
  2. Provisional gage flow data is accepted. This is important so water years can be calculated as soon as possible for assessment purposes. Waiting until all data are approved will result in too little time for assessors to review the data. Storm events usually are corrected in the provisional to accepted stage in the USGS QA process, so since we are interested in low flow events, this is not a major concern when using provisional data.

3 Automated Assessment Methods

This chapter details the specific steps required to transform the raw data outlined in the Data Organization chapter into preliminary assessment results. The general process the automated assessment specifies begins with performing data manipulation and organization steps to prepare the data for analysis, then apply specific assessment functions to the appropriate data that specify the assessment guidance and Water Quality Standards (WQS), and finally, output a “stations table” that uses data from the assessment window to summarize each station for review and bulk upload into CEDS. See the Automated Output section to understand the information provided in the stations table output.

The automated assessment scripts are trained to systematically apply assessment guidance and WQS to all data for which the appropriate metadata is provided. This supervised system merely applies the rules that developers have programmed to mimic assessment protocols. These scripts do not provide assessment decisions. It is up to the human reviewer (regional assessor) to either accept the automated result or use best professional judgment to e.g. redact questionable data or interpret complex natural systems.

3.1 Metadata Attribution

The automated assessment process hinges on each station having the appropriate metadata to analyze all raw data. The required metadata for each station include what Water Quality Standards apply to the station and which Assessment Unit(s) describe the station. One or more station can be included in an Assessment Unit. It is ultimately the Assessment Units that are used to determine whether or not designated uses are met, which is where the automation stops and the human analysis component is required.

In practice, metadata are spatially joined to stations by a rigorous data organization, spatial joining, and QA process that is detailed below. This automated process runs on the Assessment Data Analyst’s computer and results are provided to regional assessment staff for individual review through a shared shiny application. This application is known as the Regional Metadata Validation Tool and is hosted on the R server. It is up to each regional assessor to manually review each suggested metadata link prior to the assessment start date. After stations have completed the manual review process, they can be analyzed using the automated assessment scripts. Detailed instructions on how to use the Regional Metadata Validation Tool is available in the Regional Metadata Validation Tool How To section.

3.1.1 Distinct Sites

Before station metadata can be linked to stations, a list of unique stations that were sampled in a given assessment window is required. Because multiple data sources are combined for an assessment, each unique station from each data source with data in the given IR window are included in this list.

For the purposes of the Automated Assessment User Guide, we will overview the process with a snippet of conventionals and citizen monitoring data types. Please see the official script for more information.

library(tidyverse)
library(sf)
library(pins)
library(config)
library(lubridate)

# Connect to server
conn <- config::get("connectionSettings") # get configuration settings

board_register_rsconnect(key = conn$CONNECT_API_KEY,  #Sys.getenv("CONNECT_API_KEY"),
                         server = conn$CONNECT_SERVER)#Sys.getenv("CONNECT_SERVER"))

After setting up your local evironment with necessar packages and connecting to the R server, bring conventionals data into your environment.

conventionals <- pin_get('ejones/conventionals2024draft', board = 'rsconnect') %>% 
  distinct(FDT_STA_ID, .keep_all = T) %>% 
  dplyr::select(FDT_STA_ID, Latitude, Longitude) %>% 
  mutate(Data_Source = 'DEQ')
conventionals[1:50,] %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

Next, bring citizen data into your environment.

citmon <-  readRDS('exampleData/citmon.RDS') %>% 
  distinct(FDT_STA_ID, .keep_all = T) %>% 
  dplyr::select(FDT_STA_ID, Latitude, Longitude, Data_Source)
citmon[1:50,] %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

Create an object of unique sites from these two datasets.

distinctSites <- bind_rows(conventionals, citmon)
distinctSites[1:50,] %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

This process is repeated for all datasets that have data included in the assessment window. These multiple datasets are combined into a single object named distinctSites that we will then compare to existing WQS and AU information. Where stations in our distinctSites object lack either of these pieces of metadata, we must attribute them.

3.1.2 Join Assessment Region and Subbasin Information

For rendering purposes in the metadata review application, it is important to have each station linked to the appropriate Assessment Region and Subbasin to limit the amount of data called into the application just to essential information. This information is also important for processing each region through a loop for AU connection and WQS attachment. Subbasin information is important for WQS processing. The next chunk reads in the necessary assessment region and subbasin spatial data and spatially joins all sites to these layers. If the sites are missing from the distinctSites_sf object, that means the point plots outside either the assessment region or subbasin polygon. These missingSites are dealt with individually and forced to join to the nearest assessment region and subbasin before rejoining the distinctSites_sf object.

assessmentLayer <- st_read('GIS/AssessmentRegions_VA84_basins.shp') %>%
  st_transform( st_crs(4326)) # transform to WQS84 for spatial intersection

subbasinLayer <- st_read('GIS/DEQ_VAHUSB_subbasins_EVJ.shp')  %>%
  rename('SUBBASIN' = 'SUBBASIN_1')


distinctSites_sf <- st_as_sf(distinctSites,
                            coords = c("Longitude", "Latitude"),  # make spatial layer using these columns
                            remove = F, # don't remove these lat/lon cols from df
                            crs = 4326) %>% # add coordinate reference system, needs to be geographic for now bc entering lat/lng, 
  st_intersection(assessmentLayer ) %>%
  st_join(dplyr::select(subbasinLayer, BASIN_NAME, BASIN_CODE, SUBBASIN), join = st_intersects) 

# if any joining issues occur, that means that there are stations that fall outside the joined polygon area
# we need to go back in and fix them manually
if(nrow(distinctSites_sf) < nrow(distinctSites)){
  missingSites <- filter(distinctSites, ! FDT_STA_ID %in% distinctSites_sf$FDT_STA_ID) %>%
    st_as_sf(coords = c("Longitude", "Latitude"),  # make spatial layer using these columns
             remove = F, # don't remove these lat/lon cols from df
             crs = 4326) 
  
  closest <- mutate(assessmentLayer[0,], FDT_STA_ID =NA) %>%
    dplyr::select(FDT_STA_ID, everything())
  for(i in seq_len(nrow(missingSites))){
    closest[i,] <- assessmentLayer[which.min(st_distance(assessmentLayer, missingSites[i,])),] %>%
      mutate(FDT_STA_ID = missingSites[i,]$FDT_STA_ID) %>%
      dplyr::select(FDT_STA_ID, everything())
  }
  
  missingSites <- left_join(missingSites, closest %>% st_drop_geometry(),
                            by = 'FDT_STA_ID') %>%
    st_join(dplyr::select(subbasinLayer, BASIN_NAME, BASIN_CODE, SUBBASIN), join = st_intersects) %>%
    dplyr::select(-c(geometry), geometry) %>%
    dplyr::select(names(distinctSites_sf))
  
  
  distinctSites_sf <- rbind(distinctSites_sf, missingSites)
}

3.1.3 Join Stations to WQS

All stations are then spatially joined to the current WQS spatial layers to link each unique StationID to a unique WQS_ID. These unique WQS_ID’s are a concatination of the waterbody type, basin code (e.g. 2A, 2B, 2C, etc.), and a number associated with each segment in the spatial layer.

Waterbody types include:

  • RL = Riverine Line
  • LP = Lacustrine Polygon
  • EL = Estuarine Line
  • EP = Estuarine Polygon

Since transitioning the WQS storage from local (individual assessor files) to a centralized system (on the R server for multiple program uses), the number of stations that require WQS attribution decreases significantly each IR cycle. To attribute each station to WQS information, we first need to do all spatial joins to new WQS layer to get appropriate UID information, then we can send that information to an interactive application for humans to manually verify.

Where WQS_ID information is already available for stations, they are first removed from the WQS snapping process as to not repeat efforts unnecessarily. WQS_ID’s are maintained across WQS layer updates, so any new WQS information will be updated when the stations are joined to the WQS metadata.

You can view the available WQSlookup table stored on the server by using the below script. Please see the DEQ R Methods Encyclopedia for information on how to retrieve pinned data from the server.

WQStableExisting <- pin_get('ejones/WQSlookup', board = 'rsconnect')

distinctSitesToDoWQS <- filter(distinctSites_sf, ! FDT_STA_ID  %in% WQStableExisting$StationID)

Here is the table used to store link information from stations to appropriate WQS.

WQStable <- tibble(StationID = NA, WQS_ID = NA)

3.1.3.1 Spatially Join WQS Polygons

Since it is much faster to look for spatial joins by polygons compared to snapping to lines, we will run spatial joins by estuarine polys and reservoir layers first.

3.1.3.1.0.1 Estuarine Polygons (WQS)

Here we find any sites that fall into an estuary WQS polygon. This method is only applied to subbasins that intersect estuarine areas. This process also removes any estuarine sites from the data frame of unique sites that need WQS information (distinctSitesToDoWQS).

We will bring in (source) a custom function that runs the analysis for us, feed it our latest WQS information, and let the function run across all input stations. You can see the latest version of the sourced script in this repository.

source('preprocessingModules/WQS_estuaryPoly.R')

# Bring in estuary layer
estuarinePolys <- st_read('../GIS/draft_WQS_shapefiles_2022/estuarinepolygons_draft2022.shp', 
                          fid_column_name = "OBJECTID") %>%
  st_transform(4326)
  
WQStable <- estuaryPolygonJoin(estuarinePolys, distinctSitesToDoWQS, WQStable)

rm(estuarinePolys) # clean up workspace

Remove stations that fell inside estuarine polygons from the ‘to do’ list.

distinctSitesToDoWQS <- filter(distinctSitesToDoWQS, ! FDT_STA_ID %in% WQStable$StationID)
3.1.3.1.1 Lake Polygons (WQS)

Next find any sites that fall into a lake WQS polygon. This method is applied to all subbasins at once as it is a simple spatial join. You can see the latest version of the sourced script in this repository.

source('preprocessingModules/WQS_lakePoly.R')

lakesPoly <- st_read('../GIS/draft_WQS_shapefiles_2022/lakes_reservoirs_draft2022.shp',
                     fid_column_name = "OBJECTID") %>%
  st_transform(4326)

WQStable <- lakePolygonJoin(lakesPoly, distinctSitesToDoWQS, WQStable)

rm(lakesPoly) # clean up workspace

Remove stations that fell inside lake polygons from the ‘to do’ list.

distinctSitesToDoWQS <- filter(distinctSitesToDoWQS, ! FDT_STA_ID %in% WQStable$StationID)

3.1.3.2 Spatially Join WQS Lines

Now on to the more computationally heavy WQS line snapping methods. First we will try to attach riverine WQS and where stations remain we will try the estuarine WQS snap.

3.1.3.2.1 Riverine Lines (WQS)

To do this join, we will buffer all sites that don’t fall into a polygon layer by a set sequence of distances. The output will add a field called Buffer Distance to the WQStable to indicate distance required for snapping. If more than one segment is found within a set buffer distance, then many rows will be attached to the WQStable with the single identifying station name. It is up to the QA tool to help the user determine which of these UID’s are correct and drop the other records.

We then remove any riverine sites from the data frame of unique sites that need WQS information (distinctSitesToDoWQS).

source('snappingFunctions/snapWQS.R')

riverine <- st_read('../GIS/draft_WQS_shapefiles_2022/riverine_draft2022.shp',
                    fid_column_name = "OBJECTID") 

WQStable <- snapAndOrganizeWQS(distinctSitesToDoWQS, 'FDT_STA_ID', riverine, 
                               bufferDistances = seq(20,80,by=20),  # buffering by 20m from 20 - 80 meters
                               WQStable)
rm(riverine) # clean up workspace

Remove stations that attached to riverine segments from the ‘to do’ list.

distinctSitesToDoWQS <- filter(distinctSitesToDoWQS, ! FDT_STA_ID %in% WQStable$StationID)

We can use this one last opportunity to test stations that didn’t connect to the riverine WQS against the estuarine lines WQS as a one last hope of attributing some WQS information. We will take all stations from the WQStable that didn’t snap to any WQS segments (Buffer Distance ==‘No connections within 80 m’) and add those back in to our distinctSitesToDoWQS list to try to snap them to the estuarine lines spatial data.

distinctSitesToDoWQS <- filter(WQStable, `Buffer Distance` =='No connections within 80 m') %>%
  left_join(dplyr::select(distinctSites_sf, FDT_STA_ID, Latitude, Longitude, SUBBASIN) %>%
              rename('StationID'= 'FDT_STA_ID'), by='StationID') %>%
  dplyr::select(-c(geometry)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"),  # make spatial layer using these columns
           remove = T, # don't remove these lat/lon cols from df
           crs = 4326)
3.1.3.2.2 Estuarine Lines (WQS)

If a site doesn’t attach to a riverine segment, our last step is to try to attach estuary line segments before throwing an empty site to the users for the wild west of manual QA. Only send sites that could be in estuarine subbasin to this function to not waste time. Removes any estuary lines sites from the data frame of unique sites that need WQS information.

source('snappingFunctions/snapWQS.R')

estuarineLines <- st_read('../GIS/draft_WQS_shapefiles_2022/estuarinelines_draft2022.shp' , fid_column_name = "OBJECTID") #%>%
  #st_transform(102003) # forcing to albers from start bc such a huge layer
  #st_transform(4326)

# Only send sites to function that could be in estuarine environment
WQStable <- snapAndOrganizeWQS(filter(distinctSitesToDoWQS, SUBBASIN %in% c("Potomac River",
                                                                            "Rappahannock River", 
                                                                            "Atlantic Ocean Coastal",
                                                                            "Chesapeake Bay Tributaries",
                                                                            "Chesapeake Bay - Mainstem",
                                                                            "James River - Lower",  
                                                                            "Appomattox River",
                                                                            "Chowan River", 
                                                                            "Atlantic Ocean - South" ,
                                                                            "Dismal Swamp/Albemarle Sound")),
                               'StationID', estuarineLines, 
                               bufferDistances = seq(20,80,by=20),  # buffering by 20m from 20 - 80 meters
                               WQStable)

rm(estuarineLines)

Remove stations that attached to estuarine segments from the ‘to do’ list.

distinctSitesToDoWQS <- filter(distinctSitesToDoWQS, ! StationID %in% WQStable$StationID)

Double check no stations were lost in these processes.

#Make sure all stations from original distinct station list have some sort of record (blank or populated) int the WQStable.

distinctSitesToDoWQS <- filter(distinctSitesToDo, ! FDT_STA_ID  %in% WQStableExisting$StationID)
distinctSitesToDoWQS$FDT_STA_ID[!(distinctSitesToDoWQS$FDT_STA_ID %in% unique(WQStable$StationID))]

3.1.3.3 Assign something to WQS_ID so sites will not fall through the cracks when application filtering occurs

We don’t want to give all the assessors statewide all the stations that didn’t snap to something, so we need to at least partially assign a WQS_ID so the stations get into the correct subbasin on initial filter.

If a station snapped to nothing, we will assigning it a RL WQS_ID and subbasin it falls into by default.

WQStableMissing <- filter(WQStable, is.na(WQS_ID)) %>%
  # drop from list if actually fixed by snap to another segment
  filter(! StationID %in% filter(WQStable, str_extract(WQS_ID, "^.{2}") %in% c('EL','LP','EP'))$StationID) %>%
  left_join(dplyr::select(distinctSites_sf, FDT_STA_ID, BASIN_CODE) %>%
              st_drop_geometry(), by = c('StationID' = 'FDT_STA_ID')) %>%
  # some fixes for missing basin codes so they will match proper naming conventions for filtering
  mutate(BASIN_CODE1 = case_when(is.na(BASIN_CODE) ~ str_pad(
    ifelse(grepl('-', str_extract(StationID, "^.{2}")), str_extract(StationID, "^.{1}"), str_extract(StationID, "^.{2}")), 
    width = 2, side = 'left', pad = '0'),
    TRUE ~ as.character(BASIN_CODE)),
    BASIN_CODE2 = str_pad(BASIN_CODE1, width = 2, side = 'left', pad = '0'),
    WQS_ID = paste0('RL_', BASIN_CODE2,'_NA')) %>%
  dplyr::select(-c(BASIN_CODE, BASIN_CODE1, BASIN_CODE2))

WQStable <- filter(WQStable, !is.na(WQS_ID)) %>% 
  filter(! StationID %in% WQStableMissing$StationID) %>%
  bind_rows(WQStableMissing)

3.1.4 Join Stations to AUs

Assessment Units (AU) are different from WQS attribution steps because these links can change cycle to cycle as new AUs are broken off from existing AUs to more appropriately represent . Thus, a single record of all AUs linked to StationIDs like the WQSlookup table is not a good solution for this use case. Instead the AU to StationID link is stored in a more temporary format (Station Table Excel file) such that regional assessment staff can easily update the AU information on the fly as they assess stations.

Generally, the logic holds that the provided AUs are a starting point for an upcoming assessment cycle, not necessarily the final AU for said assessment cycle. The process to link unique stations in an assessment window (distinctSites_sf) to AU information begins by joining all sites in an upcoming window to the Stations Table from the most recent assessment cycle. If a station does not join to a previous assessment cycle Stations Table that means that the station either has not been sample before (e.g. a new station for Probabilistic Monitoring or a special study) or has not been sampled in a long time such that is outside the last assessment window and has not been carried over from a previous assessment cycle for any reason. It is these sites that did not join to the last cycle’s Stations Table where we will focus our efforts in the subsequent spatial joining steps.

3.1.4.1 Identify which stations need AU information

First, bring in the final Stations Table from the most recently completed IR cycle. For this example, we are sourcing the IR2022 final Stations Table (presented as a spatial object in a file geodatabase but we will strip off the spatial data first thing since it is unnecessary for this purpose). We are also going to rename the friendly publication field names to a more standardized format that the automated assessment functions were built upon (read: we are changing field names to match previous versions of the Stations Table schema since the assessment functions were built on that data schema).

final2022 <- st_read('C:/HardDriveBackup/GIS/Assessment/2022IR_final/2022IR_GISData/va_ir22_wqms.gdb',
                     layer = 'va_ir22_wqms') %>% 
  st_drop_geometry() %>% # only need tabular data from here out
  # change names of ID305B columns to format required by automated methods
  rename(ID305B_1 = Assessment_Unit_ID_1, TYPE_1 = Station_Type_1,
         ID305B_2 = Assessment_Unit_ID_2, TYPE_2 = Station_Type_2,
         ID305B_3 = Assessment_Unit_ID_3, TYPE_3 = Station_Type_3,
         ID305B_4 = Assessment_Unit_ID_4, TYPE_4 = Station_Type_4,
         ID305B_5 = Assessment_Unit_ID_5, TYPE_5 = Station_Type_5,
         ID305B_6 = Assessment_Unit_ID_6, TYPE_6 = Station_Type_6,
         ID305B_7 = Assessment_Unit_ID_7, TYPE_7 = Station_Type_7,
         ID305B_8 = Assessment_Unit_ID_8, TYPE_8 = Station_Type_8,
         ID305B_9 = Assessment_Unit_ID_9, TYPE_9 = Station_Type_9,
         ID305B_10 = Assessment_Unit_ID_10, TYPE_10 = Station_Type_10)

Now we will join distinct sites to AU information to get all available data to start the assessment process. Note: Assessors may attribute stations to different VAHU6’s compared to strictly where the site is located spatially to communicate that said stations (usually that lie close to VAHU6 border) are used to make assessment decisions about the designated VAHU6. For this reason, we use the VAHU6 designation from the previous assessment cycle over the VAHU6 retrieved from CEDS. If the station does not have a record in the previous assessment cycle Stations Table, the VAHU6 designation stored in CEDS is used.

The last rows of the below chunk ensure that each station is only listed once in the resultant table. In previous assessment cycles, stations could be assessed for multiple waterbody types (e.g. riverine and lacustrine assessment uses). Since the assessment database was moved from MS Access to CEDS WQA, this duplication is no longer allowed and thus each station should only have one record.

distinctSites_AUall <- distinctSites_sf %>% 
  st_drop_geometry() %>%
  left_join(final2022 %>% 
              dplyr::select(-c(Latitude, Longitude)), # drop duplicate lat/lng fields to avoid join issues
            by = c('FDT_STA_ID' = 'Station_ID')) %>%
  dplyr::select(FDT_STA_ID : VAHU6.y) %>% # drop the last cycle's results, not important now
  mutate(VAHU6 = ifelse(is.na(VAHU6.y), as.character(VAHU6.x), as.character(VAHU6.y))) %>% # use last cycle's VAHU6 designation over CEDS designation by default if available
  dplyr::select(-c(VAHU6.x, VAHU6.y)) %>%
  group_by(FDT_STA_ID) %>%
  mutate(n = n()) %>% ungroup()

# Find any duplicates
View(filter(distinctSites_AUall, n >1)) # 0 rows, cool

# above n> 1 used to be stations that were riverine and lacustrine makes sense, these sites are being used for riverine and lacustrine assessment
# for IR2024 this should all be cleaned up bc new WQA CEDS rules, but always good to double check

Next we will organize stations by whether or not they have AU data based on the previous join. We will call the stations that need AU information distinctSites_AUtoDo. We only test this using the ID305B_1 column because we only need each station to be attributed to at least one AU.

distinctSites_AUtoDo <- filter(distinctSites_AUall, is.na(ID305B_1)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"),  # make spatial layer using these columns
           remove = F, # don't remove these lat/lon cols from df
           crs = 4326)

# These sites are all good for automated assessment  (once we join WQS info from WQSlookup table)
distinctSites_AU <- filter(distinctSites_AUall, !is.na(ID305B_1)) 

# Quick QA: double check the math works out
nrow(distinctSites_AU) + nrow(distinctSites_AUtoDo) == nrow(distinctSites_AUall)

As with the WQS spatial attribution process, we will first join these sites to polygon layers and then line features to minimize any unnecessary computational load.

3.1.4.2 Spatially Join AU Polygons

Since it is much faster to look for spatial joins by polygons compared to snapping to lines, we will run spatial joins by estuarine polys and reservoir layers first.

3.1.4.2.1 Estuarine Polygons (AU)

We will now find any sites that fall into an estuary AU polygon. This method is only applied to subbasins that intersect estuarine areas and then removes any estuarine sites from the data frame of unique sites that need AU information (distinctSites_AUtoDo). The chunk below sources a custom function for handling AU polygon information available for download here.

source('preprocessingModules/AU_Poly.R')

# Bring in estuary layer
estuaryPolysAU <- st_read('../va_aus_estuarine.shp') %>%
   st_transform( 4326 ) %>%
   st_cast("MULTIPOLYGON") # special step for weird WKB error reading in geodatabase, never encountered before, fix from: https://github.com/r-spatial/sf/issues/427
  
estuaryPolysAUjoin <- polygonJoinAU(estuaryPolysAU, distinctSites_AUtoDo, estuaryTorF = T) %>%
  mutate(ID305B_1 = ID305B) %>%
  dplyr::select(names(distinctSites_AU)) %>%
  group_by(FDT_STA_ID) %>%
  mutate(n = n(),
         `Buffer Distance` = 'In polygon') %>%
  ungroup() 

rm(estuaryPolysAU) # clean up workspace

Now we add the newly identified estuary stations to distinctSites_AU.

distinctSites_AU <- bind_rows(distinctSites_AU, estuaryPolysAUjoin %>% st_drop_geometry())

And then remove stations that fell inside estuarine polygons from our ‘to do’ list.

distinctSites_AUtoDo <- filter(distinctSites_AUtoDo, ! FDT_STA_ID %in% estuaryPolysAUjoin$FDT_STA_ID)

nrow(distinctSites_AU) + nrow(distinctSites_AUtoDo) == nrow(distinctSites_AUall)
3.1.4.2.2 Lake Polygons (AU)

Next we identify any sites that fall into a lake AU polygon. This method is applied to all subbasins, unlike the previous estuary step. We then remove any lake sites from the data frame of unique sites that need AU information. The chunk below sources the same custom function for handling AU polygon information as above available for download here.

source('preprocessingModules/AU_Poly.R')

# Bring in Lakes layer
lakesPolyAU <-  st_read('../va_aus_reservoir.shp') %>%
   st_transform( 4326 ) %>%
   st_cast("MULTIPOLYGON") # special step for weird WKB error reading in geodatabase, never encountered before, fix from: https://github.com/r-spatial/sf/issues/427

lakesPolysAUjoin <- polygonJoinAU(lakesPolyAU, distinctSites_AUtoDo, estuaryTorF = F)%>%
  mutate(ID305B_1 = ID305B,
         `Buffer Distance` = 'In polygon') %>%
  dplyr::select(names(distinctSites_AU)) %>%
  group_by(FDT_STA_ID) %>%
  mutate(n = n()) %>%
  ungroup() 

rm(lakesPolyAU) # clean up workspace

Now add the lake stations to distinctSites_AU.

distinctSites_AU <- bind_rows(distinctSites_AU, lakesPolysAUjoin %>% st_drop_geometry())

And we remove stations that fell inside lake polygons from the ‘to do’ list. This is also a good time to double check that we haven’t lost any sites in the above process.

distinctSites_AUtoDo <- filter(distinctSites_AUtoDo, ! FDT_STA_ID %in% lakesPolysAUjoin$FDT_STA_ID)

# Double check all sites are still there
nrow(distinctSites_AU) + nrow(distinctSites_AUtoDo) == nrow(distinctSites_AUall)


rm(lakesPolysAUjoin);rm(estuaryPolysAUjoin) # clean up workspace

3.1.4.3 Spatially Join AU Lines

Now on to the more computationally heavy AU line snapping methods. We can only try to attach riverine AUs since there is not a published estuarine lines AU spatial layer (like there is for WQS information). All sites that do not snap to a riverine line within a set buffer distance will be flagged with comment saying so and the regional assessors will sort out the most appropriate AU manually in the Regional Metadata Validation Tool.

3.1.4.3.1 Riverine Lines (AU)

Time to run all the sites that didn’t fall into an AU polygon through our polyline snapping buffering script. This function is slightly different from the WQS buffering function in that it is flexible enough link any chosen fields from the two input arguments. The output of the function will add a field called Buffer Distance to distinctSites_AU to indicate the distance required for snapping. This does not get transported to the data of record, but it is useful to keep for now for QA purposes. If more than one segment is found within a set buffer distance, that many rows will be attached to the snapTable with the identifying station name. It is up to the Regional Metadata Validation Tool to help the user determine which of these AU’s are correct and subsequently drop all other records.

source('snappingFunctions/snapPointToStreamNetwork.R')

riverineAU <- st_read('../va_aus_riverine.shp') %>%
     st_transform(102003)
  
snapTable <- snapAndOrganize(distinctSites_AUtoDo, 'FDT_STA_ID', riverineAU, 
                             bufferDistances = seq(20,80,by=20),  # buffering by 20m from 20 - 80 meters
                             tibble(StationID = character(), ID305B = character(), `Buffer Distance` = character()),
                             "ID305B")


snapTable <- snapTable %>%
  left_join(distinctSites_AUtoDo, by = c('StationID' = 'FDT_STA_ID')) %>% # get station information
  rename('FDT_STA_ID' = 'StationID') %>%
  mutate(ID305B_1 = ID305B) %>%
  dplyr::select(names(distinctSites_AU), `Buffer Distance`) %>%
  group_by(FDT_STA_ID) %>%
  mutate(n = n()) %>%
  ungroup()
  
  
rm(riverineAU) # clean up workspace

Now we can add these sites to the sites with AU information.

distinctSites_AU <- bind_rows(distinctSites_AU , snapTable )

And then remove stations that attached to riverine segments from the ‘to do’ list.

distinctSites_AUtoDo <- filter(distinctSites_AUtoDo, ! FDT_STA_ID %in% snapTable$FDT_STA_ID)

We don’t have estuarine lines AU information, so the sites that don’t connect to any AU’s at the max buffer distance will have to be sorted out by the assessors.

distinctSites_AU <- distinctSites_AU %>%
  group_by(FDT_STA_ID) %>%
  mutate(n=n())

We then make sure all stations from original distinct station list (distinctSites_AUall) have some sort of record (blank or populated) in the distinctSites_AU dataset.

# check everyone dealt with
nrow(distinctSites_AU) + nrow(distinctSites_AUtoDo) == nrow(distinctSites_AUall)

distinctSites_AU$FDT_STA_ID[!(distinctSites_sf$FDT_STA_ID %in% unique(distinctSites_AU$FDT_STA_ID))]

if(nrow(distinctSites_AUtoDo) == 0){rm(distinctSites_AUtoDo)} # clean up workspace if QA check good

One last step to make sure buffer distances save correctly for future mapping needs.

unique(distinctSites_AU$`Buffer Distance`)

distinctSites_AU$`Buffer Distance` <- as.character(distinctSites_AU$`Buffer Distance`)

So, what does the end result look like?

distinctSites_AU[1:50,] %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

3.1.5 Where does this information go?

The above analyses help populate the Regional Metadata Validation Tool. After assessors manually review each station that requires WQS/AU information, this data is consolidated and stored on the R server as a pinned dataset (see for details on that process). That dataset feeds subsequent tools including the Riverine Assessment App, Lakes Assessment App, and Bioassessment Dashboard in addition to querying tools including the CEDS WQM Data Query Tool and CEDS Benthic Data Query Tool. The data is provided via the internal GIS services in the WQM Stations (All stations with full attributes) layer hosted on the GIS REST service and GIS Staff Application. In addition to the published tools that source this data, the dataset is included in countless data analysis projects and products. The entire dataset may be pulled from the R Connect service using the following script. Please see the DEQ R Methods Encyclopedia for information on how to retrieve pinned data from the server.

pin_get('ejones/WQSlookup', board = 'rsconnect') # raw lookup table
pin_get("ejones/WQSlookup-withStandards", board = "rsconnect") # lookup table with WQS information joined 

3.2 Organize Metadata

After all stations for a given assessment window have the appropriate WQS and AU information attributed, there are a number of data organization steps that still need to happen before the data are ready for the automated assessment scripts. These steps include:

  • Reorganize the conventionals dataset to match schema for automated assessment scripts
  • Add Secchi Depth to conventionals dataset
  • Add Citizen Monitoring Data to conventionals
  • Ensure all stations have necessary WQS and AU information
    • Organize all new WQS/AU metadata from the R server and adding new data to pinned datasets
    • Create a stationsTablebegin dataset
      • Carry forward any stations required from last cycle
  • Clean up PCB dataset
  • Clean up Fish Tissue dataset

The sections below are purposefully sparse as the methods are in flux given that new data are available in ODS. Stay tuned to see how these methods flesh out for more streamlined data organization in future cycles.

3.2.1 Reorganize Conventionals Dataset

The official (SAS) conventionals data schema tends to vary from IR to IR. It is essential that this data are provided to the R scripts exactly how the scripts expect data, so each cycle the provided conventionals dataset must be meticulously QAed to ensure all data are of expected name/type. Additionally, in order to automate the assessment of citizen monitoring data, we must augment the provided conventionals dataset to accommodate the level schema for each citizen monitoring parameter.

The nuanced data manipulation steps required to convert the conventionals data and citizen monitoring data from the provided format to the required format for automated analyses are beyond the scope of this book. Instead, the required data schema for automated assessment is provided below.

conventionalsSchema <- readRDS('exampleData/schemaFin.RDS')

conventionalsSchema %>% # preview data
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

3.2.2 Adding Secchi Depth to conventionals dataset

This data are not included in the official (SAS) conventionals data pull but are required for Trophic State Index (TSI) analyses for some lacustrine stations. The below script will only work if a user’s credentials have access to production ODS wqm data. The example script below first finds all unique stations in the conventionals dataset before querying secchi depth information for those stations for the given assessment window. The date range requires updating each assessment cycle.

# dataset of all the unique stations that we need to organize
conventionals_distinct <- conventionals %>%
  distinct(FDT_STA_ID, .keep_all = T) %>%
  # remove any data to avoid confusion
  dplyr::select(FDT_STA_ID:FDT_COMMENT, Latitude:Data_Source) %>%
  filter(!is.na(FDT_STA_ID))

# Query secchi data for unique conventionals stations in the appropriate data window
library(pool)
library(dbplyr)
### connect to Production Environment
pool <- dbPool(
  drv = odbc::odbc(),
  Driver = "ODBC Driver 11 for SQL Server",# or whatever ODBC driver version you have installed locally
  Server= "DEQ-SQLODS-PROD,50000",
  databasename = "ODS",
  trusted_connection = "yes"
)

stationSecchiDepth <- pool %>% tbl(in_schema('wqm', "Wqm_Field_Data_View")) %>%
  filter(Fdt_Sta_Id %in% !! conventionals_distinct$FDT_STA_ID & 
           between(as.Date(Fdt_Date_Time), "2014-12-31", "2020-12-31") &
           !is.na(Fdt_Secchi_Depth)) %>% # x >= left & x <= right
  dplyr::select(Fdt_Sta_Id, Fdt_Date_Time, Fdt_Depth, Fdt_Secchi_Depth) %>%
  as_tibble() %>%
  mutate(Date = as.Date(Fdt_Date_Time)) %>%
    dplyr::select(FDT_STA_ID = Fdt_Sta_Id, Date, FDT_DEPTH = Fdt_Depth, Fdt_Secchi_Depth) 

Once the secchi data is acquired, it can be attached to the original conventionals dataset carefully as detailed below.

conventionalsArchive <- conventionals %>%
  mutate(Date = as.Date(FDT_DATE_TIME))

conventionals <- left_join(conventionalsArchive, stationSecchiDepth, by = c('FDT_STA_ID', 'Date', 'FDT_DEPTH')) %>%
  mutate(SECCHI_DEPTH_M = Fdt_Secchi_Depth) %>%
  dplyr::select(-c(Fdt_Secchi_Depth, Date))

3.2.3 Add Citizen Monitoring Data to conventionals

Once the citizen monitoring data is queried, QAed, and matches the required conventionals schema (above), the data can be appended to the conventionals dataset for automated analysis.

3.2.4 Ensure all stations have necessary WQS and AU information

3.2.4.1 Organizing all new WQS/AU metadata from the R server

A server administrator must retrieve all the WQS/AU information from the attribution application. This directory is consolidated into a single dataset of all new StationID-WQS/AU records. It is important to verify that all expected StationIDs (all distinct stations from the conventionals and citizen monitoring datasets) have WQS and AU information before proceeding. If stations lack this information they will not be assessed accurately.

3.2.4.1.1 Adding new WQS data to pinned datasets

Once WQS_ID/AU information is available for all new stations in an assessment cycle, the data are appended to the master WQS information pin on the R server in the case of WQS information. AU information can change cycle to cycle, so the IR year for the StationID-AU information is added to the dataset before pinning back to the master AU information pin.

To expedite the transfer of actual WQS information (not just WQS_IDs), another pin contains the actual WQS metadata as well as relevant StationID. This pin is called WQSlookup-withStandards. All new stations attributed need to be joined to the relevant WQS layer to extract this metadata before it can be appended to the master WQS information pin with standards (WQSlookup-withStandards). The script below details this process by bringing in each necessary WQS layer, joining it to the WQSlookup information, and then pinning it to the R server.

# table with StationID and WQS_ID information that needs actual WQS metadata
WQSlookupToDo <- tibble(StationID = NA, WQS_ID = NA) # blank for example purposes

#bring in Riverine layers, valid for the assessment window
riverine <- st_read('../GIS/draft_WQS_shapefiles_2022/riverine_draft2022.shp',
                    fid_column_name = "OBJECTID")  %>%
  st_drop_geometry() # only care about data not geometry

WQSlookupFull <- left_join(WQSlookupToDo, riverine, by = 'WQS_ID') %>%
  filter(!is.na(CLASS)) # drop sites that didn't actually join 
rm(riverine) # clean up workspace

lacustrine <- st_read('../GIS/draft_WQS_shapefiles_2022/lakes_reservoirs_draft2022.shp',
                     fid_column_name = "OBJECTID") %>%
  st_drop_geometry() # only care about data not geometry

WQSlookupFull <- left_join(WQSlookupToDo, lacustrine, by = 'WQS_ID') %>%
  filter(!is.na(CLASS)) %>% # drop sites that didn't actually join 
  bind_rows(WQSlookupFull) # add these to the larger dataset
rm(lacustrine) # clean up workspace

estuarineLines <- st_read('../GIS/draft_WQS_shapefiles_2022/estuarinelines_draft2022.shp' , fid_column_name = "OBJECTID") %>%
  st_drop_geometry() # only care about data not geometry

WQSlookupFull <- left_join(WQSlookupToDo, estuarineLines, by = 'WQS_ID') %>%
  filter(!is.na(CLASS)) %>% # drop sites that didn't actually join 
  bind_rows(WQSlookupFull) # add these to the larger dataset
rm(estuarineLines) # clean up workspace

estuarinePolys <- st_read('../GIS/draft_WQS_shapefiles_2022/estuarinepolygons_draft2022.shp', 
                          fid_column_name = "OBJECTID") %>%
  st_drop_geometry() # only care about data not geometry

WQSlookupFull <- left_join(WQSlookupToDo, estuarinePolys, by = 'WQS_ID') %>%
  filter(!is.na(CLASS)) %>% # drop sites that didn't actually join 
  bind_rows(WQSlookupFull) # add these to the larger dataset
rm(estuarinePolys) # clean up workspace

WQSlookup_withStandards_pin <- bind_rows(WQSlookup_withStandards, WQSlookupFull) # for pin

The WQSlookup_withStandards_pin object is then pinned to the R server to be sourced by numerous other data users/products. This pin is available by using the following script.

WQSlookup_withStandards <- pin_get('ejones/WQSlookup-withStandards', board = 'rsconnect')
3.2.4.1.2 Adding new AU data to pinned datasets

The new stations in an assessment cycle that were manually reviewed using the Regional Metadata Validation Tool need this AU information added to the existing AUlookup table pin stored on the R server. After an assessor adds these stations into WQA CEDS with AU information, this table is no longer used to source AU information for a station, but it is required for the start of an assessment cycle for all stations that were not included in any previous assessments.

The server administrator must pull the AU attribution information contained in the Metadata App from the R server and append this information to the existing AUlookup table pin. The data may be retrieved by anyone using the following script.

AUlookupArchive <- pin_get('ejones/AUlookup', board = 'rsconnect')

3.2.4.2 Create a stationsTablebegin dataset

The so called ‘stationsTablebegin’ dataset is a key input to the automated assessment scripts. This dataset tells the scripts all the stations it should assess using the automated assessment functions. It is necessary to use this dataset as the list of stations to assess rather than any other individual dataset (e.g. conventionals, citmon/nonagency, PCB, etc.) because stations that need to be touched during a given assessment period might not have information in any of the individual datasets. This dataset also contains any stations from a previous cycle that must be carried over to the new cycle, i.e. carryover stations. These carryover stations may not have any data in the current window and thus would never appear in any individual dataset organized so far. The stationsTablebegin dataset contains not only all the stations that need to be addressed in an assessment cycle, but also at least one Assessment Unit per station for organizational purposes.

Because station WQS information are only used for comparing individual parameters to criteria, we do not actually store WQS metadata in the stationsTablebegin dataset. This information is joined to the conventionals dataset at a later step in the automated assessment process to maintain a dataset most similar to the CEDS bulk upload template.

Starting with the conventionals_distinct dataset (which includes citmon/non agency data at this point), we can begin to create our dataset of all the unique stations that we need to organize for the current cycle (stationsTablebegin).

# conventionals_distinct <- conventionals %>%
#   distinct(FDT_STA_ID, .keep_all = T) %>%
#   # remove any data to avoid confusion
#   dplyr::select(FDT_STA_ID:FDT_COMMENT, Latitude:Data_Source) %>%
#   filter(!is.na(FDT_STA_ID))
conventionals_distinct <- pin_get('ejones/conventionals2024_distinctdraft', board = 'rsconnect') %>% 
  filter(!is.na(Latitude) | !is.na(Longitude))


# and make a spatial version
conventionals_sf <- conventionals_distinct %>%
  st_as_sf(coords = c("Longitude", "Latitude"), 
               remove = F, # don't remove these lat/lon cols from df
               crs = 4326) # add projection, needs to be geographic for now bc entering lat/lng

We want the stationsTablebegin dataset to look like our CEDS Bulk Upload template to make future steps easier.

The biggest changes from the older stations table format to this new bulk upload template is the addition of the 10 ID305B and station type columns as well as the lacustrine designation. The lacustrine designation field identifies whether or no a lake station is considered to be within the “Lacustrine Zone” of the lake. This information is used to assess nutrients in Section 187 lakes.

stationsTemplate <- read_excel('data/WQA_CEDS_templates/WQA_Bulk_Station_Upload_Final.xlsx',#'WQA_CEDS_templates/WQA_Bulk_Station_Upload (3).xlsx', 
                               sheet = 'Stations', col_types = "text")[0,] # just want structure and not the draft data, force everything to character for now

We want to populate as much of the information from the bulk upload template for each station to make the regional assessors’ lives easier. We will start by filling in as much as we can about each station that had an entry in last cycle. We will pull this information from the WQA area of ODS. You need ODS access to retrieve this information for yourself. Later steps will allow you to source the output of these query/manipulation steps if you do not have ODS access.

3.2.4.2.1 Last cycle AU information

Let’s start by populating this template with information we can grab from the previous assessment cycle stations table information.

Connect to the ODS production environment.

library(pool)
library(dbplyr)
### Production Environment
pool <- dbPool(
  drv = odbc::odbc(),
  Driver = "ODBC Driver 11 for SQL Server", # or whatever ODBC driver version you have installed locally
  Server= "DEQ-SQLODS-PROD,50000",
  databasename = "ODS",
  trusted_connection = "yes"
)
3.2.4.2.2 Carry forward any stations required from last cycle

Find all stations from the last IR cycle that should be carried forward for review this cycle, either impaired last time or with the comment field containing “carr%” string (e.g. carry over, carried over, etc.). Start by querying stations in IR2022 and join in their AU information and station parameters. The key to these joins is the ‘WXA_STATION_DETAIL_ID’/‘Station Detail Id’ field.

# data prep
stations <- pool %>% tbl(in_schema('wqa', "Wqa_Station_Details_View")) %>%
  filter(WSD_CYCLE == 2022) %>% 
  as_tibble() %>% 
  distinct(WXA_STATION_DETAIL_ID, .keep_all = T) %>%  # distinct on this variable for AU join or duplicated rows for stations
  filter(WSD_STATION_ID != '4AGSE013.78') # problem site OIS needs to deal with

# WQA geospatial data only seems to have citmon/non agency station locations. All other stations (DEQ) need to be queried from the WQM side of ODS
stationsGeospatial_wqa <- pool %>% tbl(in_schema('wqa', 'Wqa_Stations_Geospatial_Data_View')) %>%
  filter(Station_Id %in% !! stations$STA_NAME) %>% #stations$WSD_STATION_ID) %>%
  as_tibble() %>% 
  dplyr::select(Station_Id, Latitude, Longitude) 

# WQM geospatial data for DEQ stations
stationsGeospatial_wqm <- pool %>% tbl(in_schema('wqm', 'WQM_Sta_GIS_View')) %>% 
  filter(Station_Id %in% !! stations$WSD_STATION_ID) %>% 
  as_tibble() %>% 
  dplyr::select(Station_Id, Latitude, Longitude) 

# combine geospatial data into one object for easier joining
stationsGeospatial <- bind_rows(stationsGeospatial_wqa, stationsGeospatial_wqm) %>% 
  distinct(Station_Id, .keep_all = T) %>% 
  mutate(Station_Id = case_when(Station_Id == 'Griggs Pond' ~ toupper(Station_Id),
                                Station_Id == 'Sims Metal 003' ~ 'SIMS METAL 003',
                                TRUE ~ as.character(Station_Id))) # Joining problems later if we don't capitalize the names Griggs Pond and Simms Metal 003 as they are elsewhere in ODS

# QUery AU information
AU <- pool %>% tbl(in_schema('wqa', '[WQA 305b]')) %>% # **note** need to put views with spaces in name in brackets for SQL to work
  filter(`Station Detail Id` %in% !! stations$WXA_STATION_DETAIL_ID) %>% 
  as_tibble()

stationDetails <- pool %>% tbl(in_schema('wqa', '[WQA Station Parameters Pivot]')) %>% # **note** need to put views with spaces in name in brackets for SQL to work
  filter(`Station Detail Id` %in% !! stations$WXA_STATION_DETAIL_ID) %>% 
  as_tibble()

stationType <- pool %>% tbl(in_schema('wqa', '[WQA Station Detail Types]')) %>% # **note** need to put views with spaces in name in brackets for SQL to work
  filter(`Station Detail Id` %in% !! stations$WXA_STATION_DETAIL_ID) %>% 
  as_tibble()

stationsAU <- left_join(stations, AU, by = c('WXA_STATION_DETAIL_ID' = 'Station Detail Id')) %>% 
  left_join(stationType, by =  c('WXA_STATION_DETAIL_ID' = 'Station Detail Id')) %>% 
  left_join(stationsGeospatial, by = c('STA_NAME' = 'Station_Id')) %>% # Make sure you join on STA_NAME and not WSD_STATION_ID here!!!
  left_join(stationDetails,  by = c('WXA_STATION_DETAIL_ID' = 'Station Detail Id')) 

# actual analysis, find all stations with impaired parameters
impairedStations <- stationsAU %>% 
  filter_at(.vars = vars(contains("Status Code")),
            .vars_predicate = any_vars(str_detect(., 'IM')))

# Cast a wide net: string search for any stations that were carried over from more previous cycles by looking for 
#  variations of the phrase "Carry" in the station comment field
carryoverStations <- stationsAU %>% 
  filter(str_detect(WSD_COMMENTS, 'carr'))

# combine lists and remove duplicates
stationsFromLastCycle <- bind_rows(impairedStations, carryoverStations) %>% 
  distinct(WXA_STATION_DETAIL_ID, .keep_all = T)

# clean up workspace
rm(list = c('stations','stationsAU', 'stationDetails', 'impairedStations', 'carryoverStations', 
            'AU', 'stationType', 'stationsGeospatial_wqa', 'stationsGeospatial_wqm'))

Now we need to clean up this data to match the bulk upload data template. We will also strip out the data from previous cycles to not confuse anyone from cycle to cycle.

stationsTable2024begin <- stationsFromLastCycle %>% 
  dplyr::select(STATION_ID = STA_NAME,  #### OR COULD USE WSD_STATION_ID
                ID305B_1 = `ID305B 1`,
                ID305B_2 = `ID305B 2`,
                ID305B_3 = `ID305B 3`,
                ID305B_4 = `ID305B 4`,
                ID305B_5 = `ID305B 5`,
                ID305B_6 = `ID305B 6`,
                ID305B_7 = `ID305B 7`,
                ID305B_8 = `ID305B 8`,
                ID305B_9 = `ID305B 9`,
                ID305B_10 = `ID305B 10`,
                WATER_TYPE = WWT_WATER_TYPE_DESC,
                SALINITY = WSC_DESCRIPTION,
                LACUSTRINE = WSD_LAC_ZONE_YN,
                REGION = STA_REGION,
                TYPE_1 = `Station Type 1`,
                TYPE_2 = `Station Type 2`,
                TYPE_3 = `Station Type 3`,
                TYPE_4 = `Station Type 4`,
                TYPE_5 = `Station Type 5`,
                TYPE_6 = `Station Type 6`,
                TYPE_7 = `Station Type 7`,
                TYPE_8 = `Station Type 8`,
                TYPE_9 = `Station Type 9`,
                TYPE_10 = `Station Type 10`,
                LATITUDE = Latitude,
                LONGITUDE = Longitude,
                WATERSHED = STA_WATERSHED,
                VAHU6 = STA_VA_HU6) %>% 
  mutate(REGION = case_when(REGION == 'NVRO' ~ 'NRO',
                            REGION == 'WCRO' ~ 'BRRO',
                            TRUE~ as.character(REGION))) %>% 
  dplyr::select(any_of(names(stationsTemplate)))

stationsTable2024begin <- bind_rows(stationsTemplate %>% 
                                      mutate(LATITUDE = as.numeric(LATITUDE),
                                             LONGITUDE = as.numeric(LONGITUDE)), 
                                    stationsTable2024begin) # add back in missing columns

Quick QA check for any missing geospatial data.

missingGeospatial <- filter(stationsTable2024begin, is.na(LATITUDE) | is.na(LONGITUDE))

# clean up workspace
rm(list = c('stationsFromLastCycle','missingGeospatial'))
3.2.4.2.3 AU information from Last Cylce

Now we need to get the same metadata for all the stations from the conventionals (and citmon/non agency) dataset from the current cycle. We can make our lives easier by only doing this work for new stations from the conventionals dataset (i.e. dropping all stations from our “to do list” that already have this information from our last step). This chunk will also reformat the queried data into the bulk upload template format so it can be combined with the stationsTable2024begin dataset created above.

stationsToDo <- filter(conventionals_distinct, ! FDT_STA_ID %in% stationsTable2024begin$STATION_ID)

# use the same method from above with a new station list
stations <- pool %>% tbl(in_schema('wqa', "Wqa_Station_Details_View")) %>%
  filter(WSD_STATION_ID %in% !! stationsToDo$FDT_STA_ID) %>% # pull all data from stations identified above
  as_tibble() %>% # get that data local before doing more complicated things than SQL wants to handle
  # keep only the most recent record for each station by grouping and then filtering
  group_by(WSD_STATION_ID) %>% 
  filter(WSD_CYCLE == max(WSD_CYCLE )) %>% 
  distinct(WXA_STATION_DETAIL_ID, .keep_all = T) %>%  # still need only 1 record per site
  ungroup() # ungroup so the WSD_STATION_ID column doesn't come along for the ride to future steps where not necessary
  
# WQA geospatial data only seems to have citmon/non agency station locations. All other stations (DEQ) need to be queried from the WQM side of ODS
stationsGeospatial_wqa <- pool %>% tbl(in_schema('wqa', 'Wqa_Stations_Geospatial_Data_View')) %>%
  filter(Station_Id %in% !! stations$STA_NAME) %>% #stations$WSD_STATION_ID) %>%
  as_tibble() %>% 
  dplyr::select(Station_Id, Latitude, Longitude) 

# WQM geospatial data for DEQ stations
stationsGeospatial_wqm <- pool %>% tbl(in_schema('wqm', 'WQM_Sta_GIS_View')) %>% 
  filter(Station_Id %in% !! stations$WSD_STATION_ID) %>% 
  as_tibble() %>% 
  dplyr::select(Station_Id, Latitude, Longitude) 

# combine geospatial data into one object for easier joining
stationsGeospatial <- bind_rows(stationsGeospatial_wqa, stationsGeospatial_wqm) %>% 
  distinct(Station_Id, .keep_all = T)

AU <- pool %>% tbl(in_schema('wqa', '[WQA 305b]')) %>% # **note** need to put views with spaces in name in brackets for SQL to work
  filter(`Station Detail Id` %in% !! stations$WXA_STATION_DETAIL_ID) %>% 
  as_tibble()

stationDetails <- pool %>% tbl(in_schema('wqa', '[WQA Station Parameters Pivot]')) %>% # **note** need to put views with spaces in name in brackets for SQL to work
  filter(`Station Detail Id` %in% !! stations$WXA_STATION_DETAIL_ID) %>% 
  as_tibble()

stationType <- pool %>% tbl(in_schema('wqa', '[WQA Station Detail Types]')) %>% # **note** need to put views with spaces in name in brackets for SQL to work
  filter(`Station Detail Id` %in% !! stations$WXA_STATION_DETAIL_ID) %>% 
  as_tibble()

stationsAU <- left_join(stations, AU, by = c('WXA_STATION_DETAIL_ID' = 'Station Detail Id')) %>% 
  left_join(stationType, by =  c('WXA_STATION_DETAIL_ID' = 'Station Detail Id')) %>% 
  left_join(stationsGeospatial, by = c('STA_NAME' = 'Station_Id')) %>% # Make sure you join on STA_NAME and not WSD_STATION_ID here!!!
  left_join(stationDetails,  by = c('WXA_STATION_DETAIL_ID' = 'Station Detail Id')) %>% 
  # reorganize to fit the data template
  dplyr::select(STATION_ID = STA_NAME,  #### OR COULD USE WSD_STATION_ID
                ID305B_1 = `ID305B 1`,
                ID305B_2 = `ID305B 2`,
                ID305B_3 = `ID305B 3`,
                ID305B_4 = `ID305B 4`,
                ID305B_5 = `ID305B 5`,
                ID305B_6 = `ID305B 6`,
                ID305B_7 = `ID305B 7`,
                ID305B_8 = `ID305B 8`,
                ID305B_9 = `ID305B 9`,
                ID305B_10 = `ID305B 10`,
                WATER_TYPE = WWT_WATER_TYPE_DESC,
                SALINITY = WSC_DESCRIPTION,
                LACUSTRINE = WSD_LAC_ZONE_YN,
                REGION = STA_REGION,
                TYPE_1 = `Station Type 1`,
                TYPE_2 = `Station Type 2`,
                TYPE_3 = `Station Type 3`,
                TYPE_4 = `Station Type 4`,
                TYPE_5 = `Station Type 5`,
                TYPE_6 = `Station Type 6`,
                TYPE_7 = `Station Type 7`,
                TYPE_8 = `Station Type 8`,
                TYPE_9 = `Station Type 9`,
                TYPE_10 = `Station Type 10`,
                LATITUDE = Latitude,
                LONGITUDE = Longitude,
                WATERSHED = STA_WATERSHED,
                VAHU6 = STA_VA_HU6) %>% 
  mutate(REGION = case_when(REGION == 'NVRO' ~ 'NRO',
                            REGION == 'WCRO' ~ 'BRRO',
                            TRUE~ as.character(REGION))) %>% 
  dplyr::select(any_of(names(stationsTemplate)))

# Smash in with the rest of the sites already organized
stationsTable2024begin <- bind_rows(stationsTable2024begin,
                                    stationsAU) 

# clean up workspace
rm(list = c('stations', 'stationDetails', 'stationsAU',
            'AU', 'stationType', 'stationsGeospatial_wqa', 'stationsGeospatial_wqm'))
3.2.4.2.4 AU information for New Stations

So what stations do we have left? These are stations that are in the conventionals dataset but don’t have any historical records in CEDS WQA. We will reorganize them into the template format that we need and add AU information from the pinned data on the R server.

stationsToDo <- filter(conventionals_distinct, ! FDT_STA_ID %in% stationsTable2024begin$STATION_ID) %>% 
  # glean what we can to populate the data template
  dplyr::select(STATION_ID = FDT_STA_ID,
                WATER_TYPE = STA_LV1_CODE,
                TYPE_1 = STA_LV2_CODE,
                LATITUDE = Latitude,
                LONGITUDE = Longitude) %>% 
  mutate(# throw in a flag for swamp but still need to force it to a waterbody type the assessment tools understand
         TYPE_1 =  case_when(WATER_TYPE == 'SWAMP' ~ paste(TYPE_1, 'SWAMP', sep = ': '),
                             TRUE~ as.character(TYPE_1)),
         WATER_TYPE = ifelse(WATER_TYPE == 'SWAMP', NA, WATER_TYPE)) %>% 
  dplyr::select(any_of(names(stationsTemplate))) %>% 
  # Spatially join info we don't trust from CEDS WQM, first turn this into a spatial object
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), 
               remove = F, # don't remove these lat/lon cols from df
               crs = 4326) # add projection, needs to be geographic for now bc entering lat/lng

# get pinned AU data from the R server
AUlookup <- pin_get('ejones/AUlookup', board = 'rsconnect') %>% 
  filter(CYCLE == 2024) # only get sites attributed by assessors for this cycle

# get spatial data from the R server
vahu6 <- st_as_sf(pin_get("ejones/AssessmentRegions_VA84_basins", board = "rsconnect"))
dcr11 <- st_as_sf(pin_get("ejones/dcr11", board = "rsconnect"))

# Spatially join info we don't trust from CEDS WQM
stationsLeft <- st_join(stationsToDo, vahu6) %>% 
  dplyr::select(STATION_ID:LONGITUDE, VAHU6, REGION = ASSESS_REG) %>% 
  st_join(dcr11) %>% 
  dplyr::select(STATION_ID:REGION, WATERSHED = ANCODE) %>% 
  st_drop_geometry() %>% # turn back into tibble
  left_join(dplyr::select(AUlookup, FDT_STA_ID, ID305B_1 ), by = c('STATION_ID' = 'FDT_STA_ID')) # join in AU info from the pinned AUlookup dataset (populated by assessors in metadata attribution app)

# smash into template
stationsTable2024begin <- bind_rows(stationsTable2024begin,
                                     stationsLeft)

# clean up workspace
rm(list = c('stationsLeft', 'stationsToDo', 'stationsTemplate',
            'stationsGeospatial', 'dcr11', 'vahu6', 'AUlookup'))

This dataset is pinned to the R server for anyone to use. You can retreive it using the following script.

stationsTablebegin <- pin_get('ejones/stationsTable2024begin', board = 'rsconnect')

stationsTablebegin %>% # preview first 50 rows
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

3.2.5 Clean up PCB dataset

IR 2022 had to use fuzzyjoining to get PCB StationID to a real DEQ StationID for the data to show up in apps. Data TBD so not writing cleanup methods yet.

This data is pinned to server when ready to be sourced by assessment apps/analysts.

3.2.6 Clean up Fish Tissue dataset

Who knows what this will look like. We really need a better system for updating data. No scripts included on how to do this yet since the product changes cycle to cycle.

This data is pinned to server when ready to be sourced by assessment apps/analysts.

3.2.7 Pin Official Clean IR data to R server

By cleaning up all provided data and then pinning to the R server, we can easily distribute and archive all data required and sourced by automated scripts in on secure location. The assessment applications source the pinned data to expedite application rendering time.

The official assessment data pinned to the R server include:

  • Conventionals dataset (including citizen monitoring data)
  • Water Column Metals
  • Sediment Metals
  • VAHU6 spatial data
  • WQS information for each station

3.3 Automated Assessment

Now that data are prepped for analysis, the last step is to bring all the data together, complete a few more manipulation steps, and feed the data through a process that assesses the data. To minimize the level of coding experience required to run the automated assessment process, the operation is written as a loop instead of as a function. Loops are inherently slow and sometimes confusing, but for this specific use case housing all operations inside a single loop increases the visibility of the order of operations and decreases the amount of coding experience users need to understand what is happening when. All individual parameter analyses are programmed as efficient functions and are detailed in the Individual Parameter Analyses section.

As always, first set up your local environment with all the necessary packages you will use, source any necessary scripts, and connect to the R server to pull pinned data.

The custom assessment functions brought in below are stored in separate .R files and “sourced” into this script. This is a best practice to enable easier maintenance of the custom assessment functions as well as minimize the amount of code in this Rmarkdown document. When you source the scripts, you will see all the functions in your Environment pane. This is different than when we call in a package because one cannot see all the functions in a loaded package in their Environment pane. DEQ is working towards publishing an automated assessment package to neatly distribute all these custom functions. Until then, you can download these sourced scripts here. You can download the latest working copy of the automated assessment workflow in a single Rmarkdown document here.

An important feature of this script is in the first line of code: options(digits = 12). This set our local environment options and tells R to use 12 digits when storing numeric information. It is critical to store all potential digits of numeric data to not impose intermediate rounding steps and potentially skew results. Steps like these allow for statewide consistency. Spreadsheet and database software do not easily allow this level of calculation control. It is important to note that some tidyverse default printing behavior might make it appear that numerical data are rounded to only three digits. This is a feature for fast and clear data printing and does not reflect the true nature of the numeric data stored in your environment after establishing the options(digits = 12) environment option. You can always double check the numeric data by printing the full object (e.g. datasetIamWorkingWith$numericalColumnIwantToInvestigateFurther).

options(digits = 12) # critical to seeing all the data

library(tidyverse)
library(sf)
library(readxl)
library(pins)
library(config)
library(EnvStats)
library(lubridate)
library(round) # for correct round to even logic

# Bring in custom assessment functions
source('automatedAssessmentFunctions.R')
source('updatedBacteriaCriteria.R')

# get configuration settings
conn <- config::get("connectionSettings")

# use API key to register board
board_register_rsconnect(key = conn$CONNECT_API_KEY,  
                         server = conn$CONNECT_SERVER)

3.3.1 Bring in required datasets

Bring in pinned data and local data. This section is still being actively written, stay tuned for an update with final IR2024 data.

# official March Data releases for IR 2022
conventionals <- pin_get('ejones/conventionals2024draft', board = 'rsconnect') 
conventionals_distinct <- pin_get('ejones/conventionals2024_distinctdraft', board = 'rsconnect') 

#old will update
stations2020IR <- pin_get("stations2020IR-sf-final", board = "rsconnect")
WQMstationFull <- pin_get("WQM-Station-Full", board = "rsconnect")
VSCIresults <- pin_get("VSCIresults", board = "rsconnect") %>%
  filter( between(`Collection Date`, assessmentPeriod[1], assessmentPeriod[2]) )
VCPMI63results <- pin_get("VCPMI63results", board = "rsconnect") %>%
  filter( between(`Collection Date`, assessmentPeriod[1], assessmentPeriod[2]) )
VCPMI65results <- pin_get("VCPMI65results", board = "rsconnect") %>%
  filter( between(`Collection Date`, assessmentPeriod[1], assessmentPeriod[2]) ) 
# placeholders from IR2022 until draft or final IR2024 data are available
WCmetals <- pin_get("WCmetals-2022IRfinal",  board = "rsconnect")
Smetals <- pin_get("Smetals-2022IRfinal",  board = "rsconnect")
markPCB <- read_excel('data/2022 IR PCBDatapull_EVJ.xlsx', sheet = '2022IR Datapull EVJ') # will be pinned data for IR2024
fishPCB <- read_excel('data/FishTissuePCBsMetals_EVJ.xlsx', sheet= 'PCBs') # will be pinned data for IR2024
fishMetals <- read_excel('data/FishTissuePCBsMetals_EVJ.xlsx', sheet= 'Metals') # will be pinned data for IR2024

Bring in the Station Table from the two most recent assessment cycles. The comment field from each of these datasets is joined to the Station Table output to assist regional assessment staff.

stationsTable2022 <- readRDS('../2.organizeMetadata/data/stationsTable2022.RDS')
stationsTable2020 <- readRDS('../2.organizeMetadata/data/stationsTable2020.RDS')

Bring in lake nutrient standards. The lake-specific nutrient criteria are stored separate from WQS spatial data because these fields do not exist in the WQS spatial layers. Lake names in the WQS spatial layers are notoriously difficult to join against because of all the variation in names. As such, this dataset contains a special joining name field Lake_Name to clean up joining issues.

lakeNutStandards <- read_csv('exampleData/9VAC25-260-187lakeNutrientStandards.csv')
## Parsed with column specification:
## cols(
##   Lake_Name = col_character(),
##   `Man-made Lake or Reservoir Name` = col_character(),
##   Location = col_character(),
##   `Chlorophyll a (ug/L)` = col_double(),
##   `Total Phosphorus (ug/L)` = col_double()
## )
lakeNutStandards %>% # preview dataset
  DT::datatable(rownames = F, options = list(dom = 'lftip', pageLength = 5, scrollX = TRUE))

Bring in beginning station table data. This information communicates to the scripts which stations should be assessed and where they should be organized (AUs). It also has data from the last cycle to populate the historical station information table in the application.

To Note: The AU assignments in this table are valid as of the start of the assessment process. For any AU splits/reassignments, the assessors control that through local (.csv) copies of the output of this script that is uploaded to the relevant Assessment Application on the Connect Server (and subsequently uploaded to CEDS via the bulk data upload tool).

stationTable <- pin_get('ejones/stationsTable2024begin', board = 'rsconnect')

Bring in Station Table Bulk Upload Template. This is the template format to match such that output from this script can be easily uploaded to CEDS via the Bulk Data Upload tool. Any deviations to this template in the output dataset were specifically requested by regional assessment staff to improve and expedite their review process. They are responsible for removing those added fields prior to bulk data upload.

stationsTemplate <- read_excel('WQA_CEDS_templates/WQA_Bulk_Station_Upload_Final.xlsx',#'WQA_CEDS_templates/WQA_Bulk_Station_Upload (3).xlsx', 
                               sheet = 'Stations', col_types = "text")[0,] %>% 
  mutate(LATITUDE = as.numeric(LATITUDE), LONGITUDE = as.numeric(LONGITUDE)) %>% 
  mutate_at(vars(contains('_EXC')), as.integer) %>% 
  mutate_at(vars(contains('_SAMP')), as.integer) %>% 
  # new addition that breaks station table upload template but very helpful for assessors
  mutate(BACTERIADECISION = as.character(NA),
         BACTERIASTATS = as.character(NA),
         `Date Last Sampled` = as.character(NA))

3.3.2 Attach WQS information

Pull pinned WQS info saved on server. Then perform a series of data manipulation steps that:

  1. Join WQS to station table by StationID
  2. Create a new variable that will correct for Class II pH differences for Tidal Waters
  3. Join actual WQS critera (object name WQSvalues) to each StationID
  4. Perform a little data cleanup to lose columns unnecessary for future steps
  5. Join station ecoregion information (for benthic analyses)
  6. Standardize lake names (to match 187 lake names for future joining of criteria)
    • Some problematic stations need manual steps to attach the correct lake name
  7. Join lake nutrient criteria to individual stations
  8. Rearrange data schema to more useful format after all previous joins
  9. Correct units for nutrient criteria to match conventionals data format (ug/L to mg/L)
WQSlookup <- pin_get("WQSlookup-withStandards",  board = "rsconnect")
citmonWQS <- readRDS('C:/HardDriveBackup/R/GitHub/IR2024/1.preprocessData/data/citmonStationsWithWQS.RDS') %>% 
  dplyr::select(StationID = `Station Id`, `WQS Section`, `WQS Class`, `WQS Special Standard`)

stationTable <- stationTable %>% # (1)
  
  # Special CitMon/Non Agency step until full WQS_ID inplementation in IR2028
  left_join(citmonWQS, by = c('STATION_ID' = 'StationID')) %>% # (1)
  
  # Join to real WQS_ID's (do this second in case citmon station double listed, want proper WQS_ID if available) (1)
  left_join(WQSlookup, by = c('STATION_ID' = 'StationID')) %>%
  
  # coalesce these similar fields together, taking WQS_ID info before citmon method
  mutate(CLASS = coalesce(CLASS, `WQS Class`),
         SEC = coalesce(SEC, `WQS Section`),
         SPSTDS = coalesce(SPSTDS, `WQS Special Standard`)) %>% 
  dplyr::select(-c(`WQS Section`, `WQS Class`, `WQS Special Standard`)) %>% 
  
  # Fix for Class II Tidal Waters in Chesapeake (bc complicated DO/temp/etc standard)
  mutate(CLASS_BASIN = paste(CLASS,substr(BASIN, 1,1), sep="_")) %>% # (2)
  mutate(CLASS_BASIN = ifelse(CLASS_BASIN == 'II_7', "II_7", as.character(CLASS))) %>% # (2)
  # Join actual WQS criteria to each StationID
  left_join(WQSvalues, by = 'CLASS_BASIN') %>% # (3)
  # data cleanup
  dplyr::select(-c(CLASS.y,CLASS_BASIN)) %>% # (4)
  rename('CLASS' = 'CLASS.x') %>% # (4)
  # Join station ecoregion information (for benthic analyses)
  left_join(dplyr::select(WQMstationFull, WQM_STA_ID, EPA_ECO_US_L3CODE, EPA_ECO_US_L3NAME) %>% #(5)
              distinct(WQM_STA_ID, .keep_all = TRUE), by = c('STATION_ID' = 'WQM_STA_ID')) %>%
  # Standardize lake names (to match 187 lake names for future joining of criteria)
  lakeNameStandardization() %>% # standardize lake names (6)
  
   
  # extra special step (6)
  mutate(Lake_Name = case_when(STATION_ID %in% c('2-TRH000.40') ~ 'Thrashers Creek Reservoir',
                               STATION_ID %in% c('2-LSL000.16') ~ 'Lone Star Lake F (Crystal Lake)',
                               STATION_ID %in% c('2-LSL000.04') ~ 'Lone Star Lake G (Crane Lake)',
                               STATION_ID %in% c('2-LSL000.20') ~ 'Lone Star Lake I (Butler Lake)',
                               STATION_ID %in% c('2-NWB002.93','2-NWB004.67',
                                                 '2-NWB006.06') ~ 'Western Branch Reservoir',
                               STATION_ID %in% c('2-LDJ000.60') ~ 'Lake Nottoway (Lee Lake)',
                               TRUE ~ as.character(Lake_Name))) %>%


  # Join lake nutrient criteria to individual stations
  left_join(lakeNutStandards %>% 
              mutate(Lakes_187B = 'y'),  # special step to make sure the WQS designation for 187 are correct even when not
            by = c('Lake_Name')) %>% # (7)
  # lake drummond special standards
  mutate(Lakes_187B = ifelse(is.na(Lakes_187B.y ), Lakes_187B.x, Lakes_187B.y), 
         # dd. For Lake Drummond, located within the boundaries of Chesapeake and Suffolk in the Great Dismal Swamp, chlorophyll a shall not exceed 35 µg/L and total phosphorus shall not exceed 40 µg/L at a depth of one meter or less. https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section310/
         `Chlorophyll a (ug/L)` = case_when(Lake_Name %in% c('Lake Drummond') ~ 35,
                                            TRUE ~ as.numeric(`Chlorophyll a (ug/L)`)),
         `Total Phosphorus (ug/L)` = case_when(Lake_Name %in% c('Lake Drummond') ~ 40,
                                               TRUE ~ as.numeric(`Total Phosphorus (ug/L)`))) %>% # (7)
  dplyr::select(STATION_ID:StreamType, Lakes_187B, `Description Of Waters`:`Total Phosphorus (ug/L)`) %>% # (8)
  # match lake limit to TP data unit
  mutate(`Total Phosphorus (mg/L)` = `Total Phosphorus (ug/L)` / 1000) # (9)


# Identify stations that are missing WQS
missingWQS <- filter(stationTable, is.na(CLASS))

3.3.3 Identify Station Type

It is important to have a lookup table of all lake stations to efficiently run computationally-intense lake methods only on lake stations. This next step identifies all stations that have lake AU designations. We will use this information later to determine how to analyze certain parameters that have different methods based on waterbody type.

lakeStations <- filter_at(stationTable,
                          .vars = vars(contains("ID305B")),
                          .vars_predicate = any_vars(str_detect(., 'L_')))

3.3.4 Bring in Low Flow Information

We also need to bring in the previously analyzed low flow periods for the assessment window. These USGS stream gages are analyzed using the 7Q10 methodology consistent with the DEQ Water Permitting Program with two assumption changes:

  1. Provisional data are accepted (due to the expedited assessment timeline). QA/QC changes from Provision to Accepted data generally involve storm event corrections, so using Provisional data should not affect low flow analyses.
  2. Gages analyzed for low flow statistics require 10 or more valid water years for inclusion in analyses.

This gage-specific information is spatially joined to major river subbasins to extrapolate low flow conditions to stations that fall into said watersheds. The temporal windows where low flow events were recorded in a watershed are joined to stations that fall into said watershed. Lake stations are removed from this flag. Parameters that should not be analyzed during low flow events are flagged during the automated assessment process and presented to the assessment staff in the assessment applications. A station-level flag is presented in the stationTable output of these scripts for users who prefer to only use the tabular output from the automated assessment tools.

Spatially join when doing stuff now with draft data

# Simplify the data first by date and BASIN_CODE as to not duplicate data when joined to conventionals raw data
assessmentWindowLowFlowsBasinLevelFlag <- dplyr::select(assessmentWindowLowFlows, Date, `Gage ID`, `7Q10 Flag`, BASIN_CODE) %>% 
  group_by(Date, BASIN_CODE) %>% 
  summarise(`7Q10 Flag Gage` = paste(unique(`Gage ID`),collapse = ", "),
            `7Q10 Flag` = `7Q10 Flag`) %>% 
  # still have duplicate rows with same info, so need to drop those or will duplicate conventionals data when joined in
  mutate(n = 1:n()) %>% 
  filter(n == 1) %>% 
  dplyr::select(-n) # get rid of count bc no longer needed


conventionals <- conventionals %>% 
  left_join(dplyr::select(conventionals_distinct, FDT_STA_ID, BASIN_CODE) %>% st_drop_geometry(),
            by = 'FDT_STA_ID') %>% 
  mutate(SampleDate = as.Date(FDT_DATE_TIME)) %>% 
  #left_join(dplyr::select(stationTable, STATION_ID, VAHU6), by = c('FDT_STA_ID' = 'STATION_ID')) %>% # join in VAHU6 information from Assessors
  #left_join(dplyr::select(subbasinToVAHU6, VAHU6, BASIN_CODE, Basin_Code), by = 'VAHU6')  # join in basin information by VAHU6
  left_join(assessmentWindowLowFlowsBasinLevelFlag,
            by = c('SampleDate' = 'Date', 'BASIN_CODE' = 'BASIN_CODE')) %>% 
  dplyr::select(-SampleDate) # rm unnecessary column



# Special step to remove low flow information from lake stations
conventionals <- conventionals %>% 
  mutate(`7Q10 Flag` = case_when(FDT_STA_ID %in% lakeStations$STATION_ID ~ NA_character_,
                                 TRUE ~ `7Q10 Flag`))

#View(filter(conventionals, `7Q10 Flag` == "7Q10 Flag"))

3.4 Automated Assessment Analysis

Finally, we are ready to analyze some data! Go through stationTable one station at a time, join conventionals, analyze all available data by each parameter function (or nested functions), and report out results in the WQA CEDS bulk upload template format. The next chunk is long. We start by setting up placeholder objects to store the data we want to save for later (stationTableResults and ammoniaAnalysis). The ammonia calculation takes a bit of time, so we save all the data, analysis, and results in ammoniaAnalysis so we can easily source this information for faster application rendering.

The loop runs through each station, correcting special standards information and running any thermocline analyses, as necessary. This stationData object is fed into all subsequent automated assessment functions that calculate results for individual parameters and output information into the appropriate output format. Certain parameter functions take longer to compute than others, or are sourced multiple times in this chunk, so they are run first, e.g. ammonia, Public Water Supply (PWS) criteria, nutrients, dissolved oxygen daily average criteria, etc. The results object is then constructed parameter by parameter into the Stations Table template format. This dataset is then appended to the overall output dataset, stationTableResults, before moving on to the next station. Should a station in the stationTable object not have any data in conventionals, the appropriate metadata in the Stations Table template format is still sent to the stationTable output object with any previous cycle comments, potentially indicating that the station doesn’t have new data.

Please see the Individual Parameter Analyses section for detailed descriptions of what each individual parameter functions are doing during the below analysis.

# make placeholder objects to store data results
stationTableResults <- stationsTemplate 
# save ammonia results (based on default assessment information) for use in app to speed rendering
ammoniaAnalysis <- tibble()


# time the operation so we can know how long it takes:
startTime <- Sys.time()

# loop over all sites, not super efficient but get the job done in easy to follow format
for(i in 1:nrow(stationTable)){

  print(paste('Assessing station', i, 'of', nrow(stationTable), sep=' ')) # print progress information

  # pull one station data
  stationData <- filter(conventionals, FDT_STA_ID %in% stationTable$STATION_ID[i]) %>%
    left_join(stationTable, by = c('FDT_STA_ID' = 'STATION_ID')) %>%
    # Special Standards Correction step. This is done on the actual data bc some special standards have temporal components
    pHSpecialStandardsCorrection() %>% # correct pH to special standards where necessary
    temperatureSpecialStandardsCorrection() %>% # correct temperature special standards where necessary
    # special lake steps
    {if(stationTable$STATION_ID[i] %in% lakeStations$STATION_ID)
      suppressWarnings(suppressMessages(
        mutate(., lakeStation = TRUE) %>%
        thermoclineDepth())) # adds thermocline information and SampleDate
      else mutate(., lakeStation = FALSE) }

  # --------------------------------------------------------------------------------------------------------------------------------------
  # Add some extra columns that are helpful for assessors but don't fit the bulk upload template
  
  # Previous station table comments
  comments <- stationTableComments(stationTable$STATION_ID[i], stationsTable2022, '2022', stationsTable2020, '2020')
  
  # Date last sampled
  if(nrow(stationData) > 0){
    dateLastSampled <- as.character(max(stationData$FDT_DATE_TIME)) } else {dateLastSampled <- 'No data in conventionals data pull'}
  
  # Low Flow Summary, only gives flag if station has FDT_TEMP_CELCIUS, DO_mg_L, FDT_FIELD_PH collected on day with gage in subbasin 
  # below 7Q10
  lowFlowSummary <- lowFlowFlagColumn(stationData)
  
  #---------------------------------------------------------------------------------------------------------------------------------------
  
  # Ammonia special section
  ammoniaAnalysisStation <- freshwaterNH3limit(stationData, trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
                                        mussels = TRUE, earlyLife = TRUE) 
  # https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section155/ states the assumption is that
  #  waters are to be assessed with the assumption that mussels and early life stages of fish should be present
  # trout presence is determined by WQS class, this can be changed in the app but is forced to be what the station
  # is attributed to in the automated assessment scripts
  
  # PWS Criteria
  if(nrow(stationData) > 0){
    if(is.na(unique(stationData$PWS))  ){
      PWSconcat <- tibble(PWS= NA)
   } else {
     PWSconcat <- cbind(#tibble(STATION_ID = unique(stationData$FDT_STA_ID)),
                         assessPWS(stationData, NITRATE_mg_L, LEVEL_NITRATE, 10, 'PWS_Nitrate'),
                         assessPWS(stationData, CHLORIDE_mg_L, LEVEL_CHLORIDE, 250, 'PWS_Chloride'),
                         assessPWS(stationData, SULFATE_TOTAL_mg_L, LEVEL_SULFATE_TOTAL, 250, 'PWS_Total_Sulfate')) %>%
       dplyr::select(-ends_with('exceedanceRate')) }
    
  # chloride assessment if data exists
    if(nrow(filter(stationData, !is.na(CHLORIDE_mg_L))) > 0){
      chlorideFreshwater <- rollingWindowSummary(
        annualRollingExceedanceSummary(
          annualRollingExceedanceAnalysis(chlorideFreshwaterAnalysis(stationData), yearsToRoll = 3, aquaticLifeUse = TRUE) ), "CHL")
    } else {chlorideFreshwater <- tibble(CHL_EXC = NA, CHL_STAT= NA)}
    
  # Water toxics combination with PWS, Chloride Freshwater, and water column PCB data
  if(nrow(bind_cols(PWSconcat,
                    chlorideFreshwater,
                    PCBmetalsDataExists(filter(markPCB, str_detect(SampleMedia, 'Water')) %>%
                                        filter(StationID %in% stationData$FDT_STA_ID), 'WAT_TOX')) %>%
          dplyr::select(contains(c('_EXC','_STAT'))) %>%
          mutate(across( everything(),  as.character)) %>%
          pivot_longer(cols = contains(c('_EXC','_STAT')), names_to = 'parameter', values_to = 'values', values_drop_na = TRUE) %>% 
          filter(! str_detect(values, 'WQS info missing from analysis')) %>% 
          filter(! values == "S")) >= 1) {
    WCtoxics <- tibble(WAT_TOX_EXC = NA, WAT_TOX_STAT = 'Review')
    } else { WCtoxics <- tibble(WAT_TOX_EXC = NA, WAT_TOX_STAT = NA)} 
  
  } else {
    WCtoxics <- tibble(WAT_TOX_EXC = NA, WAT_TOX_STAT = NA)
  }
   
  # If data exists for station, run it, otherwise just output what last cycle said and comment
  if(nrow(stationData) > 0){
    # Nutrients based on station type
    # Nutrient: TP (lakes have real standards; riverine no longer uses 0.2 mg/L as an observed effect for Aquatic life use)
    if(unique(stationData$lakeStation) == TRUE){
      TP <- TP_Assessment(stationData) 
      } else {
        TP <- countNutrients(stationData, PHOSPHORUS_mg_L, LEVEL_PHOSPHORUS, NA) %>% quickStats('NUT_TP') %>% 
          mutate(NUT_TP_STAT = ifelse(NUT_TP_STAT != "S", "Review", NA)) } # flag OE but don't show a real assessment decision
    
      # Nutrients: Chl a (lakes)
    if(unique(stationData$lakeStation) == TRUE){
      chla <- chlA_Assessment(stationData)
        #tibble(NUT_CHLA_EXC = NA, NUT_CHLA_SAMP = NA, NUT_CHLA_STAT = NA) # placeholder for now
      } else {
        chla <- countNutrients(stationData, CHLOROPHYLL_A_ug_L, LEVEL_CHLOROPHYLL_A, NA) %>% quickStats('NUT_CHLA') %>%
          mutate(NUT_CHLA_STAT = NA) } # don't show a real assessment decision
    
    # Run DO Daily Avg for everyone!
      DO_Daily_Avg_STAT <- paste0('DO_Daily_Avg_STAT: ', 
                                       DO_Assessment_DailyAvg(stationData) %>% 
                                         quickStats('DO_Daily_Avg') %>% 
                                         dplyr::select(DO_Daily_Avg_STAT) %>% pull())#}
    
    results <- cbind(
      StationTableStartingData(stationData),
      tempExceedances(stationData) %>% quickStats('TEMP'),
      DOExceedances_Min(stationData) %>% quickStats('DO'), 
      # this will be removed for lake stations later since it does not apply
      pHExceedances(stationData) %>% quickStats('PH'),
      bacteriaAssessmentDecisionClass( # NEW for IR2024, bacteria only assessed in two most recent years of assessment period
        filter(stationData, between(FDT_DATE_TIME, assessmentPeriod[1] + years(4), assessmentPeriod[2])),
        uniqueStationName = unique(stationData$FDT_STA_ID)),
 
      # Ammonia 
      rollingWindowSummary(
        annualRollingExceedanceSummary(
          annualRollingExceedanceAnalysis(ammoniaAnalysisStation, yearsToRoll = 3, aquaticLifeUse = FALSE)), 
        parameterAbbreviation = "AMMONIA"),
    
      # Flag for whether or not metals data exists
      metalsData(filter(WCmetals, Station_Id %in% stationData$FDT_STA_ID), 'WAT_MET'),
   
      # Mark's water column PCB results, flagged
      WCtoxics, # from above, adds in PWS and water column PCB information
    
      # Roger's sediment metals analysis, transcribed
      metalsData(filter(Smetals, Station_Id %in% stationData$FDT_STA_ID), 'SED_MET'),
      
      # Mark's sediment PCB results, flagged
      PCBmetalsDataExists(filter(markPCB, str_detect(SampleMedia, 'Sediment')) %>%
                      filter(StationID %in% stationData$FDT_STA_ID), 'SED_TOX'),
      
      # Gabe's fish metals results, flagged
      PCBmetalsDataExists(filter(fishMetals, Station_ID %in% stationData$FDT_STA_ID), 'FISH_MET'),
      
      # Gabe's fish PCB results, flagged
      PCBmetalsDataExists(filter(fishPCB, `DEQ rivermile` %in%  stationData$FDT_STA_ID), 'FISH_TOX'),
    
      # Benthics 
      benthicAssessment(stationData, VSCIresults),
      
      # Nutrient Assessment done above by waterbody type
      TP,
      chla) %>%
    
      # COMMENTS
      mutate(COMMENTS = paste0(DO_Daily_Avg_STAT) ) %>% 
      left_join(comments, by = 'STATION_ID') %>%
      left_join(lowFlowSummary, by = 'STATION_ID') %>%
      dplyr::select(-ends_with(c('exceedanceRate', 'VERBOSE', 'Assessment Decision', 'StationID'))) %>%  # to match Bulk Upload template but helpful to keep visible til now for testing
      mutate(`Date Last Sampled` = dateLastSampled)
  } else {# pull what you can from last cycle and flag as carry over
    results <- filter(stationTable, STATION_ID == stationTable$STATION_ID[i]) %>%
      dplyr::select(STATION_ID:VAHU6, COMMENTS) %>%
      mutate(COMMENTS = 'This station has no data in current window but was carried over due to IM in one of the 2020IR status fields or the 2020 stations table reports the station was carried over from a previous cycle.') %>%#,
             #BACTERIA_COMMENTS = NA) %>%
      left_join(comments, by = 'STATION_ID') %>% 
      mutate(`Date Last Sampled` = dateLastSampled)
  }
  
  stationTableResults <- bind_rows(stationTableResults,results)
  ammoniaAnalysis <- bind_rows(ammoniaAnalysis, tibble(StationID = unique(stationData$FDT_STA_ID), AmmoniaAnalysis = list(ammoniaAnalysisStation)))
    
}


stationTableResults <- bind_rows(stationsTemplate, stationTableResults) %>%
  # for now bc bacteria needs help still
  dplyr::select(STATION_ID:`7Q10 Flag`)

timeDiff = Sys.time()- startTime

3.5 Individual Parameter Analyses

The functions detailed in this section are called in either the automated assessment workflow report or by waterbody-specific shiny applications that unpack the output of the automated assessment workflow report. All of the automated assessment functions are stored in a single .R file that is shared among the waterbody-specific shiny assessment applications and Rmarkdown report to minimize the labor involved in maintaining these critical functions. One day, these functions will be published as an R package, but until then, you can access the latest version of this script here.

3.5.1 How to Test Individual Parameter Functions

The following code snippet will allow you to generate the appropriate stationData required for most function arguments. The below snippet will only run if you have run all the required data gathering and manipulation steps identified in the Automated Assessment Analysis section. The chunk below is set up such that you can easily change test stations by updating your station object.

# any station included in the stationTable object can be specified here, whether or not there is data for the station
station <- '1AACO014.57' # other potential examples: '2-JKS023.61', '1ABAR037.84'

# set up stationData object to run through individual parameter functions
stationData <- filter(conventionals, FDT_STA_ID %in% station)%>%
   left_join(stationTable, by = c('FDT_STA_ID' = 'STATION_ID')) %>%
  # Special Standards Correction step. This is done on the actual data bc some special standards have temporal components
  pHSpecialStandardsCorrection() %>% # correct pH to special standards where necessary
  temperatureSpecialStandardsCorrection() %>% # correct temperature special standards where necessary
  # special lake steps
  {if(station %in% lakeStations$STATION_ID)
      suppressWarnings(suppressMessages(
        mutate(., lakeStation = TRUE) %>%
          thermoclineDepth())) # adds thermocline information and SampleDate
      else mutate(., lakeStation = FALSE) }

3.5.1.1 Why would you want to test individual functions?

New for the IR2024 cycle, all data, methods, and results are available for regional assessment staff (or any DEQ staff) to interrogate, for riverine and lacustrine waterbody types and some estuarine methods (stay tuned on improvements on this waterbody type). The agency has been working through technology hurdles to achieve full transparency for assessment processes for a number of years, and while we aren’t fully there yet, we are getting closer. This bookdown is written as an accompanying handbook to all automated assessment processes to decipher the process for beginner R users.

One of the goals of the WQA program is that anyone with access to the assessment data and analytical scripts can run the automated assessment processes and come to the same results. By consistently applying the same rules and logic to assessment data, we can ensure the assessment decisions are universally applied across the state.

Lastly, while we hope the next section containing plain English, narrative descriptions of the automated assessment functions provide sufficient explanation as to what and why analytical steps were taken, should you run into a question, the chunk above and the functions below will allow any user to dig into any function to fully understand each process. We encourage you to investigate these functions in a stepwise manner. Should you run into problems, please reach out to the WQA Technical Support Team.

3.5.2 Temperature

A maximum temperature criteria applies to stations depending on their associated WQS. The tempExceedances() function identifies any temperature exceedances of the Max Temperature (C) field for each sample provided. The function removes all Level I and Level II data and missing data before rounding measure to even and comparing against the provided criteria. An internal 7Q10 Flag is passed through this function for use in the quickStats() function should a user adjust the default drop7Q10 argument. Read more about the quickStats() helper function in the Additional Functions section.

#Max Temperature Exceedance Function
tempExceedances <- function(stationData){
  dplyr::select(stationData, FDT_DATE_TIME, FDT_DEPTH, tidyselect::contains('TEMP_CELCIUS'), `Max Temperature (C)`, `7Q10 Flag`) %>% # Just get relevant columns 
    dplyr::filter(! (LEVEL_FDT_TEMP_CELCIUS %in% c('Level II', 'Level I'))) %>% # get lower levels out
    dplyr::filter(! is.na(FDT_TEMP_CELCIUS)) %>% # get rid of NA's
    # rename columns to make exceedance analyses easier to apply
    dplyr::rename(parameter = !! names(.[3]),
           limit = !! names(.[6])) %>% 
    # Apply Round to Even Rule before testing for exceedances
    dplyr::mutate(parameterRound = signif(parameter, digits = 2), # two significant figures based on WQS https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section50/
                  exceeds = case_when(parameterRound > limit ~ TRUE, # Identify where above max Temperature, 
                                      parameterRound <= limit ~ FALSE, # no exceedance
                                      TRUE ~ NA))
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# tempExceedances(stationData) 
# tempExceedances(stationData) %>%
#  quickStats('TEMP') # pass results to quickStats() to consolidate results

3.5.3 Dissolved Oxygen

There are two general dissolved oxygen (DO) criteria that may apply to stations, depending on their associated WQS. The minimum DO criteria are evaluated against the provided Dissolved Oxygen Min (mg/L) using the DOExceedances_Min() function. This function handles lake stations differently than other waterbody types. If a station is in a Section 187 lake, then only the epilimnion and unstratified samples are evaluated. Thermocline information is evaluated using the thermoclineDepth() function in a previous step.

Regardless of the waterbody type, all Level I and Level II data and missing data are removed before measures are rounded to even and compared against the provided criteria. An internal 7Q10 Flag is passed through this function for use in the quickStats() function should a user adjust the default drop7Q10 argument. Read more about the quickStats() helper function in the Additional Functions section.

# Minimum DO Exceedance function
DOExceedances_Min <- function(stationData){
  # special step for lake stations, remove samples based on lake assessment guidance 
  if(unique(stationData$lakeStation) == TRUE){
    if(!is.na(unique(stationData$Lakes_187B)) & unique(stationData$Lakes_187B) == 'y'){
      DOdata <- dplyr::filter(stationData, LakeStratification %in% c("Epilimnion", NA)) %>% # only use epilimnion or unstratified samples for analysis
        dplyr::select(FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, DO_mg_L, RMK_DO, LEVEL_DO, 
                      `Dissolved Oxygen Min (mg/L)`, LakeStratification, `7Q10 Flag`) # Just get relevant columns,
    } else {
      DOdata <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, DO_mg_L, RMK_DO, LEVEL_DO,
                              `Dissolved Oxygen Min (mg/L)`, LakeStratification, `7Q10 Flag`) }# Just get relevant columns,
  } else {
    DOdata <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, DO_mg_L, RMK_DO, LEVEL_DO, 
                            `Dissolved Oxygen Min (mg/L)`, `7Q10 Flag`) # Just get relevant columns, 
  }
  
  DOdata %>%
    dplyr::filter(!(LEVEL_DO %in% c('Level II', 'Level I'))) %>% # get lower levels out
    dplyr::filter(!is.na(DO_mg_L)) %>% 
    dplyr::rename(parameter = !!names(.[4]), limit = !!names(.[7])) %>% # rename columns to make functions easier to apply
    # Round to Even Rule
    dplyr::mutate(parameterRound = signif(parameter, digits = 2), # two significant figures based on  https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section50/
                  exceeds = case_when(parameterRound < limit ~ TRUE, # Identify where below min DO 
                                      parameterRound >= limit ~ FALSE, # no exceedance
                                      TRUE ~ NA))
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# DOExceedances_Min(stationData) 
# DOExceedances_Min(stationData) %>% 
#   quickStats('DO') # pass results to quickStats() to consolidate results

A separate function called DO_Assessment_DailyAvg() evaluates whether or not a station exceeds the DO daily average criteria, provided in the Dissolved Oxygen Daily Avg (mg/L) field. This analysis is only run if the number of samples for a station in a given date are greater than one. This function handles lake stations differently than other waterbody types. If a station is in a Section 187 lake, then only the epilimnion and unstratified samples are evaluated. Thermocline information is evaluated using the thermoclineDepth() function in a previous step.

Regardless of the waterbody type, all Level I and Level II data and missing data are removed before measures are rounded to even and compared against the provided criteria. An internal 7Q10 Flag is passed through this function for use in the quickStats() function should a user adjust the default drop7Q10 argument. Read more about the quickStats() helper function in the Additional Functions section.

# Daily Average exceedance function
DO_Assessment_DailyAvg <- function(stationData){ 
  # special step for lake stations, remove samples based on lake assessment guidance 
  if(unique(stationData$lakeStation) == TRUE){
    if(!is.na(unique(stationData$Lakes_187B)) & unique(stationData$Lakes_187B) == 'y'){
      DOdata <- dplyr::filter(stationData, LakeStratification %in% c("Epilimnion", NA)) %>% # only use epilimnion or unstratified samples for analysis
        dplyr::select(FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, DO_mg_L, RMK_DO, LEVEL_DO, 
                      `Dissolved Oxygen Daily Avg (mg/L)`, LakeStratification, `7Q10 Flag`) # Just get relevant columns,
    } else {
      DOdata <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, DO_mg_L, RMK_DO, LEVEL_DO,
                              `Dissolved Oxygen Daily Avg (mg/L)`, LakeStratification, `7Q10 Flag`) }# Just get relevant columns,
  } else {
    DOdata <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, DO_mg_L, RMK_DO, LEVEL_DO, 
                            `Dissolved Oxygen Daily Avg (mg/L)`, `7Q10 Flag`) # Just get relevant columns, 
  }
  
  
  DOdata %>% 
    dplyr::filter(!(LEVEL_DO %in% c('Level II', 'Level I'))) %>% # get lower levels out
    dplyr::filter(!is.na(DO_mg_L)) %>% #get rid of NA's
    dplyr::mutate(date = as.Date(FDT_DATE_TIME, format="%m/%d/%Y"), 
                  limit = `Dissolved Oxygen Daily Avg (mg/L)`) %>% 
    dplyr::group_by(date) %>%
    dplyr::mutate(n_Samples_Daily = n()) %>% # how many samples per day?
    dplyr::filter(n_Samples_Daily > 1) %>%
      # Daily average with average rounded to even
    dplyr::mutate(DO_DailyAverage = signif(mean(DO_mg_L), digits = 2),  # two significant figures based on  https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section50/
                  exceeds =  case_when(DO_DailyAverage < limit & is.na(`7Q10 Flag`) ~ TRUE, # exceedance if above limit and no 7Q10 Flag
                                       DO_DailyAverage < limit & `7Q10 Flag` == "7Q10 Flag" ~ FALSE,  # exceedance if above limit and 7Q10 Flag
                                       DO_DailyAverage >= limit ~ FALSE, # no exceedance
                                       TRUE ~ NA)) %>% 
                    #ifelse(DO_DailyAverage < `Dissolved Oxygen Daily Avg (mg/L)`,T,F)) %>% 
    dplyr::ungroup() %>%
    distinct(date, .keep_all = T) %>% 
    dplyr::select(-c(FDT_DATE_TIME)) 
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# DO_Assessment_DailyAvg(stationData)
# DO_Assessment_DailyAvg(stationData) %>% 
#   quickStats('DO_Daily_Avg')  # pass results to quickStats() to consolidate results

3.5.4 pH

A pH range criteria applies to stations depending on their associated WQS. The pHExceedances() function identifies any pH values outside of the range specified in the associated pH Min and pH Max fields. This function handles lake stations differently than other waterbody types. If a station is in a Section 187 lake, then only the epilimnion and unstratified samples are evaluated. Regardless of the waterbody type, the function removes all Level I and Level II data and missing data before rounding measures to even and comparing against the provided criteria. An internal 7Q10 Flag is passed through this function for use in the quickStats() function should a user adjust the default drop7Q10 argument. Read more about the quickStats() helper function in the Additional Functions section.

# pH range Exceedance Function
pHExceedances <- function(stationData){
  # special step for lake stations, remove samples based on lake assessment guidance 
  if(unique(stationData$lakeStation) == TRUE){
    if(!is.na(unique(stationData$Lakes_187B)) & unique(stationData$Lakes_187B) == 'y'){
      pHdata <- filter(stationData, LakeStratification %in% c("Epilimnion", NA)) %>% # only use epilimnion or unstratified samples for analysis
        dplyr::select(FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, FDT_FIELD_PH, RMK_FDT_FIELD_PH, LEVEL_FDT_FIELD_PH, 
                      `pH Min`, `pH Max`, LakeStratification, `7Q10 Flag`) # Just get relevant columns,
    } else {
      pHdata <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, FDT_FIELD_PH, RMK_FDT_FIELD_PH, LEVEL_FDT_FIELD_PH,
                              `pH Min`, `pH Max`, LakeStratification, `7Q10 Flag`) }# Just get relevant columns,
  } else {
    pHdata <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, FDT_FIELD_PH, RMK_FDT_FIELD_PH, LEVEL_FDT_FIELD_PH, 
                            `pH Min`, `pH Max`, `7Q10 Flag`) }# Just get relevant columns, 
  
  pHdata <- filter(pHdata, !(LEVEL_FDT_FIELD_PH %in% c('Level II', 'Level I'))) %>% # get lower levels out
    filter(!is.na(FDT_FIELD_PH)) #get rid of NA's
    
  # only run analysis if WQS exist for station
    if(any(is.na(pHdata$`pH Min`)) | any(is.na(pHdata$`pH Max`))){
      pH <- mutate(pHdata, interval = 1, exceeds = FALSE, limit = `pH Min`) # placeholder to run quickStats() without any WQS
    } else {
        pH <- pHdata %>%
        rowwise() %>% 
        # Round to Even Rule
        mutate(parameterRound = signif(FDT_FIELD_PH, digits = 2)) %>% # two significant figures based on WQS https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section50/
        mutate(interval=findInterval(parameterRound,c(`pH Min`,`pH Max`), left.open=TRUE, rightmost.closed = TRUE)) %>% # Identify where pH outside of assessment range with round to even
        ungroup()%>%
        mutate(exceeds = case_when(interval != 1 ~ TRUE, #  Highlight where pH doesn't fall into criteria range
                                   interval == 1 ~ FALSE,  # no exceedance if inside limit
                                   TRUE ~ NA),
               limit = `pH Min`)
        }
  return(pH)
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# pHExceedances(stationData) 
# pHExceedances(stationData) %>% 
#   quickStats('PH')  # pass results to quickStats() to consolidate results

3.5.5 Bacteria

Recreational bacteria is a more complicated standard to apply to monitoring data due to the unique rolling windows. To standardize the logic across E.coli and enterococci data, the nested function architecture that evaluates the bacteria data is programmed to handle either E.coli or enterococci data. The three functions required to calculate the bacteria criteria are detailed below.

bacteriaLogic

3.5.5.1 bacteriaExceedances_NEW

The bacteriaExceedances_NEW() function is the innermost function in the nested bacteria function logic. This is the analytical function that applies the bacteria criteria logic. This function takes in all station data, the designated bacteriaField and bacteriaRemark field arguments to tell the function which type of bacteria to assess (E.coli or enterococci), a sampleRequirement argument (minimum n samples in 90 day window needed to apply geomean), a STV argument (threshold for E.coli or enterococci), and a geomeanCriteria argument (threshold for E.coli or enterococci). The function is applied inside the bacteriaAssessmentDecision() function to calculate unique 90 day data windows, apply STV and geomean criteria where appropriate, and report out the results of this analysis for summarization and language standardization by bacteriaAssessmentDecision(). To ease later data visualization steps, a listcolumn named associatedData is included in the function output that contains the raw data that falls within each 90 day window the user can unpack to see all data associated with each 90 day window.

# Function to assess on 90 day windows across input dataset
# really just a building block, one probably wouldn't run this function independently
bacteriaExceedances_NEW <- function(stationData, # input dataframe with bacteria data
                                    bacteriaField, # name of bacteria field data
                                    bacteriaRemark, # name of bacteria comment field
                                    sampleRequirement, # minimum n samples in 90 day window needed to apply geomean
                                    STV, # unique for ecoli/enter
                                    geomeanCriteria # unique for ecoli/enter
                                    ){
  # Output tibble to organize results, need list object to save associate data
  out <- tibble(`StationID` = as.character(NA),
                `Date Window Starts` = as.Date(NA), 
                `Date Window Ends` = as.Date(NA), 
                `Last Sample Date in Window` = as.Date(NA), 
                `Samples in 90 Day Window` = as.numeric(NA), 
                `STV Exceedances In Window` = as.numeric(NA), 
                `STV Exceedance Rate` = as.numeric(NA), 
                `STV Assessment` = as.character(NA),
                `Geomean In Window` = as.numeric(NA),
                `Geomean Assessment` = as.character(NA),
                associatedData = list())
  
  # Data reorg to enable both types of bacteria assessment from a single function
  stationData2 <- dplyr::select(stationData, FDT_STA_ID, FDT_DATE_TIME, !! bacteriaField, !! bacteriaRemark) %>%
    rename(Value = bacteriaField, LEVEL_Value = bacteriaRemark) %>%
    filter(! LEVEL_Value %in% c('Level II', 'Level I')) %>% # get lower levels out
    filter(!is.na(Value))
  
  if(nrow(stationData2) > 0){
    # Loop through each row of input df to test 90 day windows against assessment criteria
    for( i in 1 : nrow(stationData2)){
      time1 <- as.Date(stationData2$FDT_DATE_TIME[i])
      timePlus89 <- time1 + days(89) 
      
      # Organize prerequisites to decision process
      windowData <- filter(stationData2, as.Date(FDT_DATE_TIME) >= time1 & as.Date(FDT_DATE_TIME) <= timePlus89) %>% 
        mutate(nSamples = n(), # count number of samples in 90 day window
               STVhit = ifelse(Value > STV, TRUE, FALSE), # test values in window against STV
               geomean = ifelse(nSamples > 1, # calculate geomean of samples if nSamples>1
                                as.numeric(round::roundAll(EnvStats::geoMean(Value, na.rm = TRUE), digits=0, "r0.C")), # round to nearest whole number per Memo to Standardize Rounding for Assessment Guidance
                                NA), 
               geomeanCriteriaHit = ifelse(geomean > geomeanCriteria, TRUE, FALSE)) # test round to even geomean against geomean Criteria
      
      # First level of testing: any STV hits in dataset? Want this information for all scenarios
      nSTVhitsInWindow <- nrow(filter(windowData, STVhit == TRUE))
      # STV exceedance rate calculation with round to even math
      STVexceedanceRate <- ifelse(unique(windowData$nSamples) >= 10, 
                                  as.numeric(round::roundAll((nSTVhitsInWindow / unique(windowData$nSamples)) * 100,digits=0, "r0.C")), # round to nearest whole number per Memo to Standardize Rounding for Assessment Guidance
                                  NA) # no STV exceedance rate if < 10 samples
      if(nSTVhitsInWindow == 0){
        `STV Assessment` <- 'No STV violations within 90 day window' } 
      if(nSTVhitsInWindow == 1){
        `STV Assessment` <- paste(nSTVhitsInWindow, ' STV violation(s) with ', format(STVexceedanceRate, digits = 3), 
                                  '% exceedance rate in 90 day window | Insufficient Information (Prioritize for follow up monitoring)',sep='')}
      if(nSTVhitsInWindow >= 2){
        `STV Assessment` <- paste(nSTVhitsInWindow, ' STV violation(s) with ', format(STVexceedanceRate, digits = 3), 
                                  '% exceedance rate in 90 day window | Impaired: ', nSTVhitsInWindow,' hits in the same 90-day period',sep='') }
      
      
      # Second level of testing: only if minimum geomean sampling requirements met in 90 day period
      if(unique(windowData$nSamples) >= sampleRequirement){
        # Geomean Hit
        if(unique(windowData$geomeanCriteriaHit) == TRUE){
          `Geomean Assessment` <- paste('Geomean: ', format(unique(windowData$geomean), digits = 3), 
                                        ' | Impaired: geomean exceeds criteria in the 90-day period', sep='')  
        } else{
          `Geomean Assessment` <-  paste('Geomean: ', format(unique(windowData$geomean), digits = 3), 
                                         ' | Geomean criteria met, hold assessment decision for further testing', sep= '')} 
      } else { # minimum geomean sampling requirements NOT met in 90 day period
        `Geomean Assessment` <- 'Insufficient Information: geomean sampling criteria not met'  }
      
      out[i,] <-  tibble(`StationID` = unique(stationData2$FDT_STA_ID),
                         `Date Window Starts` = time1, 
                         `Date Window Ends` = timePlus89, 
                         `Last Sample Date in Window` = max(windowData$FDT_DATE_TIME),
                         `Samples in 90 Day Window` = unique(windowData$nSamples), 
                         `STV Exceedances In Window` = nSTVhitsInWindow, 
                         `STV Exceedance Rate` = STVexceedanceRate,
                         `STV Assessment` = `STV Assessment`,
                         `Geomean In Window` = ifelse(unique(windowData$nSamples) >= sampleRequirement, unique(windowData$geomean), NA), # avoid excitement, only give geomean result if 10+ samples
                         `Geomean Assessment` = `Geomean Assessment`,
                         associatedData = list(windowData)) 
    } #end for loop
  } else {
    out <- tibble(`StationID` = unique(stationData$FDT_STA_ID),
                  `Date Window Starts` = as.Date(NA), 
                  `Date Window Ends` = as.Date(NA), 
                  `Last Sample Date in Window` = as.Date(NA), 
                  `Samples in 90 Day Window` = as.numeric(NA), 
                  `STV Exceedances In Window` = as.numeric(NA), 
                  `STV Exceedance Rate` = as.numeric(NA), 
                  `STV Assessment` = as.character(NA),
                  `Geomean In Window` = as.numeric(NA),
                  `Geomean Assessment` = as.character(NA),
                  associatedData = list(NA))
  }
  
  # For IR2024, we only want to make assessment decisions on data windows with unique datasets. Running distinct() on `Last Sample Date in Window`
  # will keep only the rows with the first occurrence of any given sample date, allowing us to remove rolled windows that contain duplicated 
  # datasets. Joining this result back to the out dataset allows us to flag `Valid Assessment Window` appropriately. Since this function
  # feeds the app, we want to keep all records and only at later functions that make assessment decisions do we want to filter out 
  # non-unique windows.
  distinctWindows <- distinct(out, `Last Sample Date in Window`, .keep_all = T) %>% 
    mutate(`Valid Assessment Window` = TRUE)
  
  out <- left_join(out, distinctWindows,
                    by = c("StationID", "Date Window Starts", "Date Window Ends", "Last Sample Date in Window", 
                           "Samples in 90 Day Window", "STV Exceedances In Window", "STV Exceedance Rate", 
                           "STV Assessment", "Geomean In Window", "Geomean Assessment", "associatedData")) %>% # join by everything to be safe
    dplyr::select(StationID:`Last Sample Date in Window`, `Valid Assessment Window`, everything())
  
  return(out) 
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
#bacteriaExceedances_NEW(stationData, 'ECOLI', 'LEVEL_ECOLI', 10, 410, 126)
#bacteriaExceedances_NEW(stationData, 'ENTEROCOCCI', 'LEVEL_ENTEROCOCCI', 10, 130, 35) 
# How to unpack raw data from inside each unqiue 90 day window
# y <- bacteriaExceedances_NEW(stationData, 'ECOLI', 'LEVEL_ECOLI', 10, 410, 126)
# View(y) # look at this data structure
# y$associatedData[1] # look at first 90 day window associated data

3.5.5.2 bacteriaAssessmentDecision

The bacteriaAssessmentDecision() function is the workhorse of the bacteria assessment logic, summarizing bacteria assessment results into decisions. This function passes through the provided arguments to the bacteriaExceedances_NEW() function to perform the bacteria criteria analysis. The function is applied inside the bacteriaAssessmentDecisionClass() function to summarize and report out the results of the bacteriaExceedances_NEW() analysis. To ease later data visualization steps, a listcolumn named associatedDecisionData is included in the function output that the user can unpack to see all data associated with each 90 day window.

# Function to summarize bacteria assessment results into decisions
# This function returns all potential issues with priory on geomean results IF there
# are enough samples to run geomean
# Round to even rules are applied
bacteriaAssessmentDecision <- function(stationData, # input dataframe with bacteria data
                                       bacteriaField, # name of bacteria field data
                                       bacteriaRemark, # name of bacteria comment field
                                       sampleRequirement, # minimum n samples in 90 day window needed to apply geomean
                                       STV, # unique for ecoli/enter
                                       geomeanCriteria # unique for ecoli/enter
){
    # Rename output columns based on station table template
    stationTableName <- ifelse(bacteriaField == 'ECOLI', "ECOLI", "ENTER")
    
    nSamples <- select(stationData,  Value = {{ bacteriaField }} ) %>% 
      filter(!is.na(Value)) # total n samples taken in assessment window
    
    
    if(nrow(nSamples) > 0){ # only proceed through decisions if there is data to be analyzed

    # Run assessment function
    # make two objects here because we want to base all decisions only on valid data, but we also want to output all associated data from 
      # this function, so we create a rawAnalysisForOutput to be saved in a list column for unpacking later
    rawAnalysisForOutput <- suppressWarnings(bacteriaExceedances_NEW(stationData, bacteriaField, bacteriaRemark, sampleRequirement, STV, geomeanCriteria)   )
    validForAssessment <- rawAnalysisForOutput %>% 
      filter(`Valid Assessment Window` == TRUE)
    
    # bail out if no data to analyze bc all Level II or Level I, OR (new for IR2024) no valid windows for assessment
    if(nrow(filter(validForAssessment, !is.na(`Date Window Starts`))) == 0 ){
      return(tibble(StationID = unique(stationData$FDT_STA_ID),
                    `_EXC` = NA, # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                    #`_IMPAIREDWINDOWS` = NA,
                    `_SAMP` = NA, 
                    `_GM.EXC` = NA,
                    `_GM.SAMP` = NA,
                    `_STAT` = NA, # is this the right code???
                    `_STAT_VERBOSE` = NA, 
                    `BACTERIADECISION` = NA,
                    `BACTERIASTATS` = NA,
                    associatedDecisionData = list(NA)) %>%
               rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
               rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
    }
    
    # number of STV exceedances, reported in bacteria_EXC field in stations table and useful for logic testing
    # don't want to do this analysis on validForAssessment bc sum(validForAssessment$`STV Exceedances In Window`) will overshoot real n
    exceedSTVn <- select(stationData,  Value = {{ bacteriaField }} ) %>%
      filter(Value > STV) # total STV exceedances in dataset
    # Windows with > 10% STV rate, these can only be calculated on windows with 10 or more samples
    exceedSTVrate <- filter(validForAssessment, `STV Exceedance Rate` > 10)
    # windows with geomean exceedances, these can only be calculated on windows with 10 or more samples
    exceedGeomean <- filter(validForAssessment, `Geomean In Window` > geomeanCriteria)
    
    # Decision logic time, work through geomean first and if there is no appropriate geomean data (no windows with 10+ samples)
    #   then go to STV assessment
    
    # Were at least 10 samples taken within any 90-day period of the assessment window?
    if( any(!is.na(validForAssessment$`Geomean In Window`)) ){ # Were at least 10 samples taken within any 90-day period of the assessment window? - Yes
      # Do the geometric means calculated for the 90-day periods represented by 10+ samples meet the GM criterion?
      if( nrow(exceedGeomean) == 0){ # Do the geometric means calculated for the 90-day periods represented by 10+ samples meet the GM criterion? - Yes
        # Do any of the 90-day periods of the assessment window represented in the dataset exceed the 10% STV Exceedance Rate?
        if( nrow(exceedSTVn) > 0){ # Do any of the 90-day periods of the assessment window represented in the dataset exceed the 10% STV Exceedance Rate? - Yes
          
          # Yes, in a 90-day period represented by 10+ samples
          if(nrow(filter(exceedSTVrate, `Samples in 90 Day Window` >= 10 & `STV Exceedance Rate` > 10)) > 0){ # STV exceedances in a 90-day period represented by >= 10 samples
            return(tibble(StationID = unique(validForAssessment$StationID),
                          `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                          `_SAMP` = nrow(nSamples), 
                          `_GM.EXC` = nrow(exceedGeomean),
                          `_GM.SAMP` = nrow(filter(validForAssessment, !is.na(`Geomean In Window`))),
                          `_STAT` = "IM",
                          `_STAT_VERBOSE` = "Impaired - 2 or more STV exceedances in the same 90-day period represented by 10+ samples, no geomean exceedances.",#STV exceedances in a 90-day period represented by >= 10 samples after verifying geomean passes where applicable.",
                          `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                          `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                          associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                     rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                     rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
            
          } else {  # STV exceedances in a 90-day period represented by < 10 samples
            
            # 2 or more hits in the same 90-day period?
            if(any(validForAssessment$`STV Exceedances In Window` >= 2) ){
              return(tibble(StationID = unique(validForAssessment$StationID),
                            `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                            `_SAMP` = nrow(nSamples), 
                            `_GM.EXC` = nrow(exceedGeomean),
                            `_GM.SAMP` = nrow(filter(validForAssessment, !is.na(`Geomean In Window`))),
                            `_STAT` = "IM",
                            `_STAT_VERBOSE` = "Impaired- 2 or more STV exceedances in the same 90-day period with < 10 samples, no geomean exceedances.", #2 or more STV hits in the same 90-day period with < 10 samples after verifying geomean passes where applicable.",
                            `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                            `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                            associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                       rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                       rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
              } else { 
                
                # did the STV exceedance(s) occur in windows with 10+ samples?
                if(all(filter(validForAssessment, `STV Exceedances In Window` > 0)$`Samples in 90 Day Window` >= 10)){
                  return(tibble(StationID = unique(validForAssessment$StationID),
                                `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                                `_SAMP` = nrow(nSamples), 
                                `_GM.EXC` = nrow(exceedGeomean),
                                `_GM.SAMP` = nrow(filter(validForAssessment, !is.na(`Geomean In Window`))),
                                `_STAT` = "S",
                                `_STAT_VERBOSE` = "Fully Supporting - No STV exceedance rates >10% or geomean exceedances in any 90-day period represented by 10+ samples.",# No geomean exceedances and STV exceedance(s) in one or multiple 90-day periods represented by 10+ samples.", # previous language: 1 STV hit in one or multiple 90-day periods with < 10 samples after verifying geomean passes where applicable.",
                                `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                                `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                                associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                           rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                           rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
                  
                } else {# STV exceedance(s) occured in windows with < 10 samples
                
                # 1 hit in one or multiple 90-day periods after verifying geomean passes where applicable
                return(tibble(StationID = unique(validForAssessment$StationID),
                              `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                              `_SAMP` = nrow(nSamples), 
                              `_GM.EXC` = nrow(exceedGeomean),
                              `_GM.SAMP` = nrow(filter(validForAssessment, !is.na(`Geomean In Window`))),
                              `_STAT` = "O",
                              `_STAT_VERBOSE` = "Fully Supporting - No geomean exceedances and only 1 STV exceedance in one or multiple 90-day periods represented by < 10 samples.", # previous language: 1 STV hit in one or multiple 90-day periods with < 10 samples after verifying geomean passes where applicable.",
                              `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                              `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                              associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                         rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                         rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
                } 
                }
          }
          
        } else {  # Do any of the 90-day periods of the assessment window represented in the dataset exceed the 10% STV Exceedance Rate? - No
          return(tibble(StationID = unique(validForAssessment$StationID),
                        `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                        `_SAMP` = nrow(nSamples), 
                        `_GM.EXC` = nrow(exceedGeomean),
                        `_GM.SAMP` = nrow(filter(validForAssessment, !is.na(`Geomean In Window`))),
                        `_STAT` = "S",
                        `_STAT_VERBOSE` = "Fully Supporting - No STV exceedance rates >10% or geomean exceedances in any 90-day period represented by 10+ samples.", #No STV exceedances or geomean exceedances in any 90-day period.",
                        `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                        `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                        associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                   rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                   rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
          }
        
      } else { # Do the geometric means calculated for the 90-day periods represented by 10+ samples meet the GM criterion? - No
        return(tibble(StationID = unique(validForAssessment$StationID),
                      `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                      `_SAMP` = nrow(nSamples), 
                      `_GM.EXC` = nrow(exceedGeomean),
                      `_GM.SAMP` = nrow(filter(validForAssessment, !is.na(`Geomean In Window`))),
                      `_STAT` = "IM",
                      `_STAT_VERBOSE` = "Impaired- geomean exceedance in any 90-day period.", #geomean exceedance(s) in any 90-day period with >= 10 samples.",
                      `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                      `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                      associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                 rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                 rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
        }
      
    } else { # Were at least 10 samples taken within any 90-day period of the assessment window? - No
      # Were there any hits of the STV during the dataset?
      if( nrow(exceedSTVn) == 0){ # Were there any hits of the STV during the dataset? - No
        return(tibble(StationID = unique(validForAssessment$StationID),
                      `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                      `_SAMP` = nrow(nSamples), 
                      `_GM.EXC` = as.numeric(NA), #nrow(exceedGeomean), # Data Entry manual updated to require NA instead of 0 if < 10 samples per 90 day window
                      `_GM.SAMP` = as.numeric(NA), #nrow(filter(validForAssessment, !is.na(`Geomean In Window`))), # Data Entry manual updated to require NA instead of 0 if < 10 samples per 90 day window
                      `_STAT` = "IN", 
                      `_STAT_VERBOSE` = "Insufficient Information (Prioritize for follow up monitoring)- No STV exceedances but insufficient data to analyze geomean.", #0 STV hits but insufficient data to analyze geomean.",
                      `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                      `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                      associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                 rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE))%>%  # fix names to match station table format
                 rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
        } else { # Were there any hits of the STV during the dataset? - Yes
          # 2 or more hits in the same 90-day period
          if(any(validForAssessment$`STV Exceedances In Window` >= 2) ){
            return(tibble(StationID = unique(validForAssessment$StationID),
                          # not quite right yet
                          `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the number of STV exceedances in a 90-day period with 10+ samples
                          `_SAMP` = nrow(nSamples), 
                          `_GM.EXC` = as.numeric(NA), #nrow(exceedGeomean), # Data Entry manual updated to require NA instead of 0 if < 10 samples per 90 day window
                          `_GM.SAMP` = as.numeric(NA), #nrow(filter(validForAssessment, !is.na(`Geomean In Window`))), # Data Entry manual updated to require NA instead of 0 if < 10 samples per 90 day window
                          `_STAT` = "IM", 
                          `_STAT_VERBOSE` = "Impaired - 2 or more STV hits in the same 90-day period with < 10 samples.",
                          `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                          `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                          associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                     rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE))%>%  # fix names to match station table format
                     rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
          } else { 
            # 1 hit in one or multiple 90-day periods
            return(tibble(StationID = unique(validForAssessment$StationID),
                          `_EXC` = nrow(exceedSTVn), # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                          `_SAMP` = nrow(nSamples), 
                          `_GM.EXC` = as.numeric(NA), #nrow(exceedGeomean), # Data Entry manual updated to require NA instead of 0 if < 10 samples per 90 day window
                          `_GM.SAMP` = as.numeric(NA), #nrow(filter(validForAssessment, !is.na(`Geomean In Window`))), # Data Entry manual updated to require NA instead of 0 if < 10 samples per 90 day window
                          `_STAT` = "IN",
                          `_STAT_VERBOSE` = "Insufficient Information (Prioritize for follow up monitoring)- One STV exceedance in one or multiple 90-day periods but insufficient data to analyze geomean.",#1 STV hit in one or multiple 90-day periods but insufficient data to analyze geomean.",
                          `BACTERIADECISION` = paste0(stationTableName, ": ",`_STAT_VERBOSE`),
                          `BACTERIASTATS` = paste0(stationTableName, ": Number of 90 day windows with > 10% STV exceedance rate: ", nrow(exceedSTVrate)),
                          associatedDecisionData = list(rawAnalysisForOutput) ) %>%
                     rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
                     rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
            }
        }
      }
    # No bacteria data to analyze
    } else {
      return(tibble(StationID = unique(stationData$FDT_STA_ID),
                    `_EXC` = NA, # right now this is set to # total STV exceedances, not the # STV exceedances in a 90-day period with 10+ samples
                    `_SAMP` = NA, 
                    `_GM.EXC` = NA,
                    `_GM.SAMP` = NA,
                    `_STAT` = NA, # is this the right code???
                    `_STAT_VERBOSE` = NA, 
                    `BACTERIADECISION` = NA,
                    `BACTERIASTATS` = NA,
                    associatedDecisionData = list(NA)) %>%
               rename_with( ~ gsub("_", paste0(stationTableName,"_"), .x, fixed = TRUE)) %>%  # fix names to match station table format
               rename_with( ~ gsub(".", "_", .x, fixed = TRUE)) ) # special step to get around accidentally replacing _GM with station table name
    }
  }

# Example Usage:
# where stationData object is already in your environment, see Automated Assessment for environment set up help
# bacteriaAssessmentDecision(stationData, 'ECOLI', 'LEVEL_ECOLI', 10, 410, 126)
# bacteriaAssessmentDecision(stationData, 'ENTEROCOCCI', 'LEVEL_ENTEROCOCCI', 10, 130, 35)
# To get just info for station table  
#bacteriaAssessmentDecision(stationData, 'ECOLI', 'LEVEL_ECOLI', 10, 410, 126) %>%
#  dplyr::select(StationID:ECOLI_STAT)
#bacteriaAssessmentDecision(stationData, 'ENTEROCOCCI', 'LEVEL_ENTEROCOCCI', 10, 130, 35) %>%
#  dplyr::select(StationID:ENTER_STAT)

# How to unpack analyzed data from a single site
# x <- bacteriaAssessmentDecision(stationData, 'ENTEROCOCCI', 'LEVEL_ENTEROCOCCI', 10, 130, 35)
# View(x) # look at this data structure
# x$associatedDecisionData[1] # look at each 90 day window used to make the assessment decision

3.5.5.3 bacteriaAssessmentDecisionClass

The bacteriaAssessmentDecisionClass() is the outermost function to assess bacteria data. The function takes in all data for a single station as well as a StationID (to provide adequate output information in case no bacteria data exist for said station). Based on the WQS class, E.coli is analyzed if the site is freshwater or E.coli and Enterococci are analyzed if the site is saltwater or transitional. It is up to the regional assessor to determine which bacteria analysis best represents these transitional sites. Regardless of the waterbody type, all Level I and Level II data and missing data are removed before measures are rounded to even and compared against the provided criteria. The output of this function includes all the bacteria columns required in the Station Table for CEDS bulk data upload as well as a verbose comment that is provided to the assessor for standardized inclusion in CEDS WQA.

## outermost function to decide which bacteria should be assessed based on WQS Class
bacteriaAssessmentDecisionClass <- function(stationData, # input dataframe with bacteria data
                                            uniqueStationName # stationID for output in case no data come into function
                                            ){

  
  # stop everything if no data to analyze in stationData
  if(nrow(stationData) == 0){
    return(
      tibble(StationID = uniqueStationName, ECOLI_EXC = as.numeric(NA), ECOLI_SAMP = as.numeric(NA), ECOLI_GM_EXC = as.numeric(NA), ECOLI_GM_SAMP = as.numeric(NA),
             ECOLI_STAT = as.character(NA), ECOLI_STATECOLI_VERBOSE = as.character(NA),
             ENTER_EXC = as.numeric(NA), ENTER_SAMP = as.numeric(NA), ENTER_GM_EXC = as.numeric(NA), ENTER_GM_SAMP = as.numeric(NA),
             ENTER_STAT = as.character(NA), ENTER_STATENTER_VERBOSE = as.character(NA)) )}
  
  # lake stations should only be surface sample
  if(unique(stationData$lakeStation) == TRUE){
    stationData <- filter(stationData, FDT_DEPTH <= 0.3) }
  
  if(nrow(stationData) > 0){
    if(unique(stationData$CLASS) %in% c('I', 'II')){
            # previously this was programmed to only output enterococci results for these classes, but assessors requested both outputs to be conservative
      return(
        left_join(bacteriaAssessmentDecision(stationData, 'ECOLI', 'LEVEL_ECOLI', 10, 410, 126),
                  bacteriaAssessmentDecision(stationData, 'ENTEROCOCCI', 'LEVEL_ENTEROCOCCI', 10, 130, 35),
                  by = 'StationID') %>% 
          mutate(BACTERIADECISION = paste0(BACTERIADECISION.x, ' | ', BACTERIADECISION.y),
                 BACTERIASTATS = paste0(BACTERIASTATS.x, ' | ', BACTERIASTATS.y)) %>% 
          dplyr::select(-c(BACTERIADECISION.x, BACTERIADECISION.y, BACTERIASTATS.x, BACTERIASTATS.y,
                           associatedDecisionData.x, associatedDecisionData.y)) )
      } else {
      return(
        bacteriaAssessmentDecision(stationData, 'ECOLI', 'LEVEL_ECOLI', 10, 410, 126) %>%
          dplyr::select(StationID:BACTERIASTATS) %>% #ECOLI_STATECOLI_VERBOSE) %>%
          mutate(ENTER_EXC = as.numeric(NA), ENTER_SAMP = as.numeric(NA), ENTER_GM_EXC = as.numeric(NA), ENTER_GM_SAMP = as.numeric(NA),
                 ENTER_STAT = as.character(NA), ENTER_STATENTER_VERBOSE = as.character(NA)) ) }
  } else {
    return(
      tibble(StationID = uniqueStationName, ECOLI_EXC = as.numeric(NA), ECOLI_SAMP = as.numeric(NA), ECOLI_GM_EXC = as.numeric(NA), ECOLI_GM_SAMP = as.numeric(NA),
             ECOLI_STAT = as.character(NA), ECOLI_STATECOLI_VERBOSE = as.character(NA),
             ENTER_EXC = as.numeric(NA), ENTER_SAMP = as.numeric(NA), ENTER_GM_EXC = as.numeric(NA), ENTER_GM_SAMP = as.numeric(NA),
             ENTER_STAT = as.character(NA), ENTER_STATENTER_VERBOSE = as.character(NA)) )}
}

# Example Usage:
# where stationData object is already in your environment, see Automated Assessment for environment set up help
# bacteriaAssessmentDecisionClass(stationData)

3.5.6 Nutrients

3.5.7 Ammonia

Ammonia is evaluated using acute, chronic, and four-day criteria across rolling three year windows that may not exceed more than once every three years on the average in free flowing streams. The presence or absence of trout, freshwater mussel species, and early life stages of fish during most times of the year each affect the calculation of ammonia criteria for a given waterbody. The WQS state that “the Department of Environmental Quality, after consultation with the Virginia Department of Wildlife Resources and the U.S. Fish and Wildlife Service, has determined that the majority of Virginia freshwaters are likely to contain, or have contained in the past, freshwater mussel species in the family Unionidae and contain early life stages of fish during most times of the year. Therefore, the ammonia criteria presented in subsections B and C of this section are designed to provide protection to these species and life stages.” In practice, this means, unless demonstrated otherwise, mussels and fish early life stages are assumed present for criteria method calculations. Once the method for criteria calculation has been established for a waterbody, individual temperature and pH measurements are required to calculate individual criteria, and averages of these parameters are required when more than one measure is present within a given criteria window (for chronic, 30 day, and four day criteria).

Ammonia is a complicated criteria to program and understand. The automated assessment scripts have organized ammonia criteria calculation into nested functions such that individual components of the analysis can be extracted when needed (e.g. during interactive application data visualization). To expedite application rendering time, calculated ammonia results are stored in a R dataset (.RDS) and provided to the application instead of recalculating the criteria on the fly in the application. As such, the WQS (including trout present/absent) attributed to a station in the metadata attribution process as well as mussels and early life fish stages present values are the default values when exploring ammonia criteria analyses in the interactive waterbody-specific shiny assessment applications. Should sufficient justification exist to depart from these default values for individual stations, please contact the WQA Technical Support Team.

3.5.7.1 freshwaterNH3limit

The calculation and application of ammonia criteria is performed by the freshwaterNH3limit() function. This function takes in data from one station (stationData argument) as well as three additional boolean (TRUE/FALSE) arguments: trout (trout present/absent), mussels (mussels present/absent), and earlyLife (early life stages of fish present/absent). The trout argument determined by the associated WQS Class where Class V and VI waters indicate trout are present and any other class indicates trout are absent. As stated above, the mussels and earlyLife arguments are default set to TRUE to be most protective, but should justification exist, these options may be toggled to FALSE and result in different criteria methods applied to the data.

This function is complicated and will be explained using the order of operations presented internally. Running through each scenario individually could assist understanding of the logic. First, since this is a computationally-intensive function, if no data are available to be analyzed, the function returns NULL. If data are present, the function removes all Level I and Level II data, missing data, and data greater than 1 meter depth before rechecking whether data are there for analysis and returning NULL if not.

The function then creates storage objects for acute, chronic, and four day average results and loops through each row of data to correctly calculate acute criteria and find any chronic scenarios (and then run a 4 day analysis if data exist). Each new sample event prompts the creation of acute, chronic, and four day data windows. If sufficient data exist in any of these windows, the window temperature and pH averages are used to calculate ammonia criteria based on the scenario arguments established above for comparison to the window ammonia average, rounded to even. Chronic and four day windows are only calculated if more than one result exists in those data windows. The data analyzed within each window is saved as a listcolumn called associatedData that may be extracted for investigation later. The output of the function is a long dataset where each unique sample event could have a row summarizing the acute, chronic, and four day calculated criteria, sample count in the given window, exceedance result, and associated data (associatedData listcolumn).

# Calculate limits and return dataframe with original data and limits 9VAC25-260-155 https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section155/
freshwaterNH3limit <- function(stationData, # dataframe with station data
                               trout, # T/F condition
                               mussels,# T/F condition
                               earlyLife# T/F condition
){
  # If no data, return nothing
  if(nrow(stationData)==0){return(NULL)}
  
  # remove any data that shouldn't be considered
  stationDataAmmonia <- filter(stationData, !(LEVEL_FDT_TEMP_CELCIUS %in% c('Level II', 'Level I')) |
                                 !(LEVEL_FDT_FIELD_PH %in% c('Level II', 'Level I'))) %>% # get lower levels out
    # # lake stations should only be surface sample
    # {if(unique(stationData$lakeStation) == TRUE)
      filter(., FDT_DEPTH <= 1) %>%  # all samples should be < 1m (this allows for 0.3m surface samples and 0m integrated samples)
      # else . } %>%
    filter(!is.na(AMMONIA_mg_L)) %>% #get rid of NA's
    dplyr::select(FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, FDT_TEMP_CELCIUS, FDT_FIELD_PH, AMMONIA_mg_L) %>% 
    # can only run analysis if all above variable are populated for a given date/time
    filter(!is.na(FDT_TEMP_CELCIUS) & !is.na(FDT_FIELD_PH))
  # If no data, return nothing
  if(nrow(stationDataAmmonia)==0){return(NULL)}
  
  
  # make a place to store analysis results
  acuteCriteriaResults <- tibble(FDT_STA_ID = as.character(NA), WindowDateTimeStart = as.POSIXct(NA), 
                                 FDT_DEPTH = as.character(NA),# character so we can concatenate multiple depth values if needed
                                 Value = as.numeric(NA), ValueType = as.character(NA), 
                                 `Criteria Type` = as.character(NA), CriteriaValue = as.numeric(NA), 
                                 `Sample Count` = as.numeric(NA), 
                                 parameterRound = as.numeric(NA), Exceedance = as.numeric(NA),
                                 associatedData = list())
  chronicCriteriaResults <- acuteCriteriaResults
  fourDayCriteriaResults <- acuteCriteriaResults
  
  # loop through each row of data to correctly calculate acute criteria and find any chronic scenarios (and then 
  #   run a 4 day analysis if data exist)
  for(k in stationDataAmmonia$FDT_DATE_TIME){  #k = stationDataAmmonia$FDT_DATE_TIME[2]
    acuteDataWindow <- filter(stationDataAmmonia,  between(FDT_DATE_TIME, k, k + hours(1)))
    chronicDataWindow <- filter(stationDataAmmonia,  between(FDT_DATE_TIME, k, k + days(30)))
    fourDayDataWindow <- filter(stationDataAmmonia,  between(FDT_DATE_TIME, k, k + days(4)))
    
    
    # Run acute analysis if data exists
    # Acute is calculated on 1 hour windows, so we need to average temperature and pH within each 1 hour window before we can calculate an
    #  acute criteria. There are no rules on how many samples need to occur in a 1 hour window for the window to be valid, but chronic and
    #  4 day criteria require > 1 sample to be analyzed.
    
    # Acute criteria can combine multiple depths in calculation of criteria, <1m difference (which is taken care of above)
    
    if(nrow(acuteDataWindow) > 0){
      acuteDataCriteriaAnalysis <- suppressMessages( 
        acuteDataWindow %>% 
          #  group_by(FDT_STA_ID) %>% # for some reason if multiple depths but 1 stationID get 2 rows when group_by(FDT_STA_ID)
          summarise(FDT_STA_ID = paste0(unique(FDT_STA_ID), collapse = ' '), ### this is the ugly fix to problem identified above
                    FDT_DEPTH = paste0(sort(unique(FDT_DEPTH)), collapse = ' | '),
                    TempValue = mean(FDT_TEMP_CELCIUS, na.rm=T),  # get hourly average; don't round to even bc more calculations to follow with data
                    pHValue = mean(FDT_FIELD_PH, na.rm=T),  # get hourly average; don't round to even bc more calculations to follow with data
                    Value = mean(AMMONIA_mg_L, na.rm=T),  # get hourly average; don't round to even bc more calculations to follow with data
                    `Sample Count` = length(AMMONIA_mg_L)) %>%  #count sample that made up average
          mutate(ValueType = 'Hourly Average',
                 ID = paste( FDT_STA_ID, FDT_DEPTH, sep = '_'), # make a uniqueID in case >1 sample for given datetime
                 `Criteria Type` = 'Acute') %>% 
          {if(trout == TRUE & mussels == TRUE)  # Trout & mussels present scenario
            mutate(., CriteriaValue = as.numeric(signif(
              min(((0.275 / (1 + 10^(7.204 - pHValue))) + (39.0 / (1 + 10^(pHValue - 7.204)))),
                  (0.7249 * ( (0.0114 / (1 + 10^(7.204 - pHValue))) + (1.6181 / (1 + 10^(pHValue - 7.204)))) * (23.12 * 10^(0.036 * (20 - TempValue))) )), digits = 2)))
            else . } %>%
          {if(trout == TRUE & mussels == FALSE) # Trout present & mussels absent scenario
            mutate(., CriteriaValue = as.numeric(signif(
              min(((0.275 / (1 + 10^(7.204 - pHValue))) + (39.0 / (1 + 10^(pHValue - 7.204)))),
                  (0.7249 * ( (0.0114 / (1 + 10^(7.204 - pHValue))) + (1.6181 / (1 + 10^(pHValue - 7.204)))) * (62.15 * 10^(0.036 * (20 - TempValue))) )), digits = 2)))
            else . } %>%
          {if(trout == FALSE & mussels == TRUE) # Trout absent & mussels present scenario
            mutate(., CriteriaValue = as.numeric(signif(
              0.7249 * ((0.0114 / (1 + 10^(7.204 - pHValue))) + (1.6181 / (1 + 10^(pHValue - 7.204)))) * min(51.93, (23.12 * 10^(0.036 * (20 - TempValue)))), digits = 2)))
            else .} %>% 
          {if(trout == FALSE & mussels == FALSE) # Trout & mussels absent scenario
            mutate(., CriteriaValue = as.numeric(signif(
              0.7249 * ((0.0114 / (1 + 10^(7.204 - pHValue))) + (1.6181 / (1 + 10^(pHValue - 7.204)))) * min(51.93, (62.15 * 10^(0.036 * (20 - TempValue)))), digits = 2)))
            else .} %>% 
          mutate(parameterRound = signif(Value, digits = 2), # two significant figures based on 9VAC25-260-155 https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section155/
                 Exceedance = ifelse(parameterRound > CriteriaValue, 1, 0 ), # use 1/0 to easily summarize multiple results later
                 WindowDateTimeStart = min(acuteDataWindow$FDT_DATE_TIME)) %>% 
          dplyr::select(FDT_STA_ID, WindowDateTimeStart, everything(), -ID) ) 
      # now save data for later, in a format that matches with desired output
      acuteDataCriteriaAnalysis <- acuteDataCriteriaAnalysis %>% 
        dplyr::select(-c(TempValue, pHValue)) %>% 
        bind_cols(tibble(associatedData = list(left_join(acuteDataWindow,
                                                         dplyr::select(acuteDataCriteriaAnalysis, FDT_STA_ID,  WindowDateTimeStart, TempValue, pHValue, Value),
                                                         by = c('FDT_STA_ID', 'FDT_DATE_TIME' = 'WindowDateTimeStart')))) )
      
      # Save the results for viewing later
      acuteCriteriaResults <- bind_rows(acuteCriteriaResults, acuteDataCriteriaAnalysis) 
      
    } else {acuteCriteriaResults <- acuteCriteriaResults }
    

    # Run chronic analysis if enough data exists, must have > 1 sample in chronic window to run analysis
    # Chronic is calculated on 30 day windows, so we need to average temperature and pH within each 30 day window before we can calculate a
    #  chronic criteria. The chronic criteria will be associated with each sample date that starts a 30 day period, but it applies to all
    #  samples within the 30 day window. All raw data associated with each window is saved as a listcolumn for later review.
    
    # Chronic criteria can combine multiple depths in calculation of criteria, <1m difference (which is taken care of above)
    
    if(nrow(chronicDataWindow) > 1){ # need 2 or more data points to run a chronic
      chronicDataCriteriaAnalysis <- suppressMessages(
        chronicDataWindow %>% 
        #  group_by(FDT_STA_ID) %>% # for some reason if multiple depths but 1 stationID get 2 rows when group_by(FDT_STA_ID)
          summarise(FDT_STA_ID = paste0(unique(FDT_STA_ID), collapse = ' '), ### this is the ugly fix to problem identified above
                    FDT_DEPTH = paste0(sort(unique(FDT_DEPTH)), collapse = ' | '),
                    TempValue = mean(FDT_TEMP_CELCIUS, na.rm = T), # get 30 day average; don't round to even bc more calculations to follow with data
                    pHValue = mean(FDT_FIELD_PH, na.rm = T), # get 30 day average; don't round to even bc more calculations to follow with data
                    Value = mean(AMMONIA_mg_L, na.rm = T), # get 30 day average; don't round to even bc more calculations to follow with data
                    `Sample Count` = length(AMMONIA_mg_L)) %>%  #count sample that made up average
          mutate(ValueType = '30 Day Average',
                 ID = paste( FDT_STA_ID, FDT_DEPTH, sep = '_'), # make a uniqueID in case >1 sample for given datetime
                 `Criteria Type` = 'Chronic') %>%
        # Trout & mussels present scenario
        {if(trout == TRUE & mussels == TRUE & earlyLife == TRUE)  # Trout & mussels present scenario & earlyLife == TRUE
          mutate(., CriteriaValue = as.numeric(signif(
            0.8876 * ((0.0278 / (1 + 10^(7.688 - pHValue))) + (1.1994 / (1 + 10^(pHValue - 7.688)))) * (2.126 * 10^(0.028 * (20 - max(7, TempValue)))), digits = 2)))
          else .} %>%
        {if(trout == TRUE & mussels == TRUE & earlyLife == FALSE)  # Trout & mussels present scenario & earlyLife == FALSE
          mutate(., CriteriaValue = as.numeric(NA))
          else .} %>%

        # Trout present & mussels absent scenario
        {if(trout == TRUE & mussels == FALSE & earlyLife == TRUE)  # Trout present & mussels absent scenario & earlyLife == TRUE
          mutate(., CriteriaValue = as.numeric(signif(
            0.9405 * ((0.0278 / (1 + 10^(7.688 - pHValue))) + (1.1994 / (1 + 10^(pHValue - 7.688)))) * min(6.92, (7.547 * 10^(0.028 * (20 - TempValue)))), digits = 2)))
          else .} %>%
        {if(trout == TRUE & mussels == FALSE & earlyLife == FALSE)  # Trout present & mussels absent scenario & earlyLife == FALSE
          mutate(., CriteriaValue = as.numeric(signif(
            0.9405 * ((0.0278 / (1 + 10^(7.688 - pHValue))) + (1.1994 / (1 + 10^(pHValue - 7.688)))) * (7.547 * 10^(0.028 * (20 - max(TempValue, 7)))), digits = 2)))
          else .} %>%

        # Trout absent & mussels present scenario
        {if(trout == FALSE & mussels == TRUE & earlyLife == TRUE)  # Trout absent & mussels present scenario & earlyLife == TRUE
          mutate(., CriteriaValue = as.numeric(signif(
            0.8876 * ((0.0278 / (1 + 10^(7.688 - pHValue))) + (1.1994 / (1 + 10^(pHValue - 7.688)))) * (2.126 * 10^(0.028 * (20 - max(7, TempValue)))), digits = 2)))
          else .} %>%
        {if(trout == FALSE & mussels == TRUE & earlyLife == FALSE)  # Trout absent & mussels present scenario & earlyLife == FALSE
          mutate(., CriteriaValue = as.numeric(NA))
          else .} %>%

         # Trout & mussels absent scenario
        {if(trout == FALSE & mussels == FALSE & earlyLife == TRUE)  # Trout & mussels absent scenario & earlyLife == TRUE
          mutate(., CriteriaValue = as.numeric(signif(
            0.9405 * ((0.0278 / (1 + 10^(7.688 - pHValue))) + (1.1994 / (1 + 10^(pHValue - 7.688)))) * min(6.92, (7.547 * 10^(0.028 * (20 - TempValue)))), digits = 2)))
          else .} %>%
        {if(trout == FALSE & mussels == FALSE & earlyLife == FALSE)  # Trout & mussels absent scenario & earlyLife == FALSE
          mutate(., CriteriaValue = as.numeric(signif(
            0.9405 * ((0.0278 / (1 + 10^(7.688 - pHValue))) + (1.1994 / (1 + 10^(pHValue - 7.688)))) * (7.547 * 10^(0.028 * (20 - max(TempValue, 7)))), digits = 2)))
          else .} %>%

        mutate(parameterRound = signif(Value, digits = 2), # two significant figures based on 9VAC25-260-155 https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section155/
               Exceedance = ifelse(parameterRound > CriteriaValue, 1, 0 ), # use 1/0 to easily summarize multiple results later
               WindowDateTimeStart = min(acuteDataWindow$FDT_DATE_TIME)) %>%
        dplyr::select(FDT_STA_ID, WindowDateTimeStart, everything(), -ID) ) 
      # now save data for later, in a format that matches with desired output
      chronicDataCriteriaAnalysis <- chronicDataCriteriaAnalysis %>% 
        dplyr::select(-c(TempValue, pHValue)) %>% 
        bind_cols(tibble(associatedData = list(left_join(chronicDataWindow,
                                                         dplyr::select(chronicDataCriteriaAnalysis, FDT_STA_ID,  WindowDateTimeStart, TempValue, pHValue, Value),
                                                         by = c('FDT_STA_ID', 'FDT_DATE_TIME' = 'WindowDateTimeStart')))) )

      # Save the results for viewing later
      chronicCriteriaResults <- bind_rows(chronicCriteriaResults, chronicDataCriteriaAnalysis)
    } else {chronicCriteriaResults <- chronicCriteriaResults }
    
    
    # Run 4 day analysis if data exists
    # 4 day analysis is run on 4 day windows, so we need to average temperature and pH within each 4 day window before we can compare to a 2.5x 
    #  chronic criteria. The chronic criteria associated with each sample date that starts a 4 day period is used for the 4 day analysis. 
    #  All raw data associated with each window is saved as a listcolumn for later review.
    
    # 4day criteria can combine multiple depths in calculation of criteria, <1m difference (which is taken care of above)
    
    if(nrow(fourDayDataWindow) > 1){
      fourDayDataCriteriaAnalysis <- suppressMessages( 
        fourDayDataWindow %>% 
          #  group_by(FDT_STA_ID) %>% # for some reason if multiple depths but 1 stationID get 2 rows when group_by(FDT_STA_ID)
          summarise(FDT_STA_ID = paste0(unique(FDT_STA_ID), collapse = ' '), ### this is the ugly fix to problem identified above
                    FDT_DEPTH = paste0(sort(unique(FDT_DEPTH)), collapse = ' | '),
                    TempValue = mean(FDT_TEMP_CELCIUS, na.rm=T),  # get 4day average; doesn't do anything bc this doesn't go into criteria calculation but needed to match other data format
                    pHValue = mean(FDT_FIELD_PH, na.rm=T),  # get 4day average; doesn't do anything bc this doesn't go into criteria calculation but needed to match other data format
                    Value = mean(AMMONIA_mg_L, na.rm=T),  # get 4day average; don't round to even bc want to see raw and parameterRound columns in output
                    `Sample Count` = length(AMMONIA_mg_L)) %>%  #count sample that made up average
          mutate(ValueType = 'Four Day Average',
                 ID = paste( FDT_STA_ID, FDT_DEPTH, sep = '_'), # make a uniqueID in case >1 sample for given datetime
                 `Criteria Type` = 'Four Day') %>% 
          left_join(dplyr::select(chronicDataCriteriaAnalysis, FDT_STA_ID, `Chronic CriteriaValue` = CriteriaValue),
                    by = c('FDT_STA_ID')) %>% 
          mutate(CriteriaValue = as.numeric(signif(`Chronic CriteriaValue` * 2.5, digits = 2)), # 4 day criteria = 2.5x chronic criteria
                 parameterRound = signif(Value, digits = 2), # two significant figures based on 9VAC25-260-155 https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section155/
                 Exceedance = ifelse(parameterRound > CriteriaValue, 1, 0 ), # use 1/0 to easily summarize multiple results later
                 WindowDateTimeStart = min(fourDayDataWindow$FDT_DATE_TIME)) %>% 
          dplyr::select(FDT_STA_ID, WindowDateTimeStart, everything(), -c(ID, `Chronic CriteriaValue` ) ) )
      # now save data for later, in a format that matches with desired output
      fourDayDataCriteriaAnalysis <- fourDayDataCriteriaAnalysis %>% 
        dplyr::select(-c(TempValue, pHValue)) %>% 
        bind_cols(tibble(associatedData = list(left_join(fourDayDataWindow,
                                                         dplyr::select(fourDayDataCriteriaAnalysis, FDT_STA_ID,  WindowDateTimeStart, TempValue, pHValue, Value),
                                                         by = c('FDT_STA_ID', 'FDT_DATE_TIME' = 'WindowDateTimeStart')))) )
      
      # Save the results for viewing later
      fourDayCriteriaResults <- bind_rows(fourDayCriteriaResults, fourDayDataCriteriaAnalysis) 
      
    } else {fourDayCriteriaResults <- fourDayCriteriaResults }
    
          

    combinedResults <- bind_rows(acuteCriteriaResults, chronicCriteriaResults) %>%
      {if(nrow(fourDayCriteriaResults) > 0)
        bind_rows(., fourDayCriteriaResults)
        else .} %>% 
      arrange(WindowDateTimeStart, `Criteria Type`)
        
    }
  
      
      return( combinedResults)#bind_rows(acuteDataCriteriaAnalysis, chronicDataCriteriaAnalysis))
}

# Example Usage:
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
# ammoniaAnalysisStation <- freshwaterNH3limit(stationData, 
#                                              trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                                              mussels = TRUE, 
#                                              earlyLife = TRUE) 
# https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section155/ states the assumption is that
# waters are to be assessed with the assumption that mussels and early life stages of fish should be present
# trout presence is determined by WQS class, this can be changed in the app but is forced to be what the station
# is attributed to in the automated assessment scripts

3.5.7.2 annualRollingExceedanceAnalysis

Calculating the appropriate ammonia criteria is just the first part of assessing ammonia. To assess ammonia in free-flowing and tidal waters for Wildlife and Aquatic Life Uses, acute, chronic and four day criteria must not exceed respective criteria more than once every three years on the average. Helper functions have been established to assist with the automation of these decisions. The annualRollingExceedanceAnalysis() helper function takes in the output of the freshwaterNH3limit() function and an additional argument (yearsToRoll) that established the number of years to roll analyses over, for our use case we will use three years. The argument aquaticLifeUse allows this function to be used across multiple data input types. The default allows for the logic from the ammonia freshwaterNH3limit() function to be used to flag whether or not a window is valid for later assessment steps. An example of when one would change the default argument for aquaticLifeUse from FALSE to TRUE is if they were assessing freshwater chloride for the Aquatic Life Designated Use. When aquaticLifeUse is set to TRUE, the function uses the Aquatic Life Use definition of valid windows to flag windows for later assessment steps. The Assessment Guidance states: “For toxic pollutant assessment of Aquatic Life Designated Use in free-flowing streams, both chronic and acute criteria can be assessed whenever sufficient data are available as applicable. Chronic criteria are to be assessed when multiple grab samples are collected within two separate four-day periods within a three-year period, or when there are two or more separate 30-day SPMD deployments within a three-year period. Two samples (either grab or SPMD) taken within three consecutive years are sufficient to assess acute criteria.”

The function extracts the year from each data window analyzed and drops any criteria that do not meet sample count rules (e.g. chronic or four day window results that were calculated on 1 or fewer samples). Based on the unique years of the input data windows and the yearsToRoll argument (3 in our case), the function loops over data within each yearsToRoll window range to calculate the number of exceedances in the rolled window (and whether or not the window is valid) by criteria type (e.g. acute, chronic, or four day) and appends all data used for the summary in a listcolumn called associatedData to simplify data review. Unique window ranges are not repeated thanks to the logic outlined in the windowRange range established for the for loop.

# New rolling window analysis function. This calculates the number of exceedances in a X year window (not repeating windows)
#  and determines if the number of windows exceeding criteria are greater than the number of windows not exceeding criteria
#  using subsequent annualRollingExceedanceSummary()
# analyze results by 3 year windows, not a strict rolling window, roll by year
annualRollingExceedanceAnalysis <- function(dataToAnalyze,
                                            yearsToRoll, # number of years to roll analysis over
                                            aquaticLifeUse = FALSE # whether or not this function is being used for Aquatic Life Use criteria
                                            # default = FALSE, which uses the Ammonia definition for window validity for chronic and four day sampling criteria
                                            # if TRUE, two samples across the rolled window length ensures a valid window
                                            ){
  
  if(is.null(dataToAnalyze)){return(NULL)}
  
  # place to store results
  dataWindowResults <- tibble(FDT_STA_ID = as.character(NA), `Window Begin` = as.numeric(NA),
                              FDT_DEPTH = as.character(NA),# character so we can concatenate multiple depth values if needed
                              `Criteria Type` = as.character(NA),
                              `Years Analysis Rolled Over`= as.numeric(NA), 
                              `Exceedances in Rolled Window` = as.numeric(NA), 
                              associatedData = list())
  dataToAnalyze <- dataToAnalyze %>% 
    mutate(Year = year(WindowDateTimeStart),
           `Valid Window` = case_when(`Valid Window` = case_when(`Criteria Type` %in% c("Acute") ~ TRUE, # just to make people feel good later on
                                                                 `Criteria Type` %in% c("Chronic", "Four Day")  & `Sample Count` > 1 ~ TRUE,
                                                                 TRUE ~ NA))
  
  # First, identify all window start options
  windowOptions <- unique(year(dataToAnalyze$WindowDateTimeStart))
  # Now, stop loop from repeating windows by capping the top end by yearsToRoll-1 (so don't have 3 yr windows exceeding assessment period), then remove any
  #  windowRange options that don't actually have data in those years
  windowRange <- (min(windowOptions):(max(windowOptions)- (yearsToRoll- 1)))[(min(windowOptions):(max(windowOptions)- (yearsToRoll- 1))) %in% windowOptions]
  

  for(i in windowRange){ #i = min(min(windowRange):(max(windowRange)- (yearsToRoll- 1))[1])
    dataWindow <- filter(dataToAnalyze, Year %in% i:(i + yearsToRoll- 1) ) # minus 1 year for math to work
    for(k in unique(dataWindow$`Criteria Type`)){
      dataWindowCriteria <- filter(dataWindow, `Criteria Type` %in% k)
      dataWindowAnalysis <- suppressMessages( dataWindowCriteria %>% 
                                                group_by(FDT_STA_ID, #FDT_DEPTH, 
                                                         `Criteria Type`) %>%
                                                {if(aquaticLifeUse == FALSE)
                                                  summarise(., `Criteria Type` = unique(`Criteria Type`),
                                                            `Window Begin` = year(min(WindowDateTimeStart)),
                                                            `Years Analysis Rolled Over` = yearsToRoll, 
                                                            `Exceedances in Rolled Window` = sum(Exceedance),
                                                            `Valid Window` = all(`Valid Window` == T))
                                                  else summarise(., `Criteria Type` = unique(`Criteria Type`),
                                                                 `Window Begin` = year(min(WindowDateTimeStart)),
                                                                 `Years Analysis Rolled Over` = yearsToRoll, 
                                                                 `Exceedances in Rolled Window` = sum(Exceedance),
                                                                 `Valid Window` = ifelse(nrow(dataWindowCriteria) >= 2, TRUE, NA))
                                                  }   %>% 
                                                bind_cols(tibble(associatedData = list(dataWindowCriteria))) )
      dataWindowResults <- bind_rows(dataWindowResults, dataWindowAnalysis) 
      }
    }
  return(dataWindowResults)
}


# Example Usage: (demonstrating nested function syntax and pipes, both result in the same output):
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
# annualRollingExceedanceAnalysis(
#   freshwaterNH3limit(stationData,
#                    trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                    mussels = TRUE,
#                    earlyLife = TRUE),
#   yearsToRoll = 3, aquaticLifeUse = FALSE)
# freshwaterNH3limit(stationData,
#                    trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                    mussels = TRUE,
#                    earlyLife = TRUE) %>% 
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE)

# Freshwater Chloride Example:
# annualRollingExceedanceAnalysis(
#   chlorideFreshwaterAnalysis(stationData),
#   yearsToRoll = 3, aquaticLifeUse = TRUE) 
# chlorideFreshwaterAnalysis(stationData) %>%
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE) 

3.5.7.3 annualRollingExceedanceSummary

The annualRollingExceedanceSummary() helper function summarizes the output of the annualRollingExceedanceAnalysis() function to suggest an ammonia result for Wildlife and Aquatic Life Designated uses. This function determines whether the number of exceeding windows are greater than the number of windows without exceedances and suggests a result based on that analysis. Before suggesting any results, the function ensures all data windows are valid (based upon flags chronic and four day sampling criteria analyzed in previous functions). After summarizing the window results by criteria type, the function suggests a result to the user where a “Supporting” output is provided only if the number of windows without exceedances are greater than the number of windows with exceedances. All other summaries result in a “Review” flag.

# Flag if more windows exceeding than not
annualRollingExceedanceSummary <- function(rolledAnalysis){
  
  if(is.null(rolledAnalysis)){return(NULL)}
  
  # make a valid dataset for this analysis
  validDataForExceedanceSummary <- bind_rows(
    rolledAnalysis %>% 
      filter(`Criteria Type` == 'Chronic' & `Valid Window` == TRUE),
    rolledAnalysis %>% 
      filter(`Criteria Type` == 'Acute')  ) %>% 
    {if("Four Day" %in% unique(rolledAnalysis$`Criteria Type`))
      bind_rows(.,
                filter(rolledAnalysis, `Criteria Type` == "Four Day"& `Valid Window` == TRUE) )
      else . }

  
  suppressMessages(
    validDataForExceedanceSummary %>% 
    group_by(FDT_STA_ID, FDT_DEPTH, `Criteria Type`) %>% 
    summarise(`n Windows Fine` = sum(`Exceedances in Rolled Window` < 2),
              `n Windows Exceeding` = sum(`Exceedances in Rolled Window` >= 2)) %>% 
    mutate(`Suggested Result` = case_when(`n Windows Fine` > `n Windows Exceeding` ~ "Supporting",
                                          `n Windows Fine` < `n Windows Exceeding` ~ "Review",
                                          `n Windows Fine` == `n Windows Exceeding` ~ "Review",
                                          TRUE ~ as.character(NA))))
    
}

# Example Usage: (demonstrating nested function syntax and pipes, both result in the same output):
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
# annualRollingExceedanceSummary(
#   annualRollingExceedanceAnalysis(
#   freshwaterNH3limit(stationData,
#                    trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                    mussels = TRUE,
#                    earlyLife = TRUE),
#   yearsToRoll = 3, aquaticLifeUse = FALSE)
# )
# freshwaterNH3limit(stationData,
#                    trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                    mussels = TRUE,
#                    earlyLife = TRUE) %>%
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE) %>% 
#   annualRollingExceedanceSummary()

# Freshwater Chloride Example:
# annualRollingExceedanceSummary(
#   annualRollingExceedanceAnalysis(
#     chlorideFreshwaterAnalysis(stationData),
#     yearsToRoll = 3, aquaticLifeUse = TRUE) )
# chlorideFreshwaterAnalysis(stationData) %>%
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE) %>%
#   annualRollingExceedanceSummary()

3.5.7.4 rollingWindowSummary

The rollingWindowSummary() helper function summarizes the output of the annualRollingExceedanceSummary() function for a simplified output for the Station Table. The nested series of functions that analyzed the ammonia data are summarized in an output of two columns, detailing the number of exceedances and the suggested parameter status. As of the writing of this document, the value desired in the number of exceedances column has not been determined (potentially the number of windows with exceedances or the number of exceedances across all windows), so to be conservative, this is left as NA right now and requires input by assessment staff before upload into CEDS. However, the parameter status is summarized as such. Should a “Review” result be contained in any of the acute, chronic, or four day results, then the parameter status will be set to “Review” to signify to the assessor that further review is needed for the decision. If all criteria assessed contain “Supporting” results, then the parameter status will suggest the parameter status is supporting (“S”). The parameterAbbreviation argument passes a character string of the chosen parameter name to the Station Table output column headers.

# Rolling summary for stations table
rollingWindowSummary <- function(annualRollingExceedanceSummaryOutput, parameterAbbreviation){
  
  if("Review" %in% unique(annualRollingExceedanceSummaryOutput$`Suggested Result`)){
    z <- tibble(`_EXC` = as.numeric(NA), `_STAT` = 'Review')
  } else {
    # first kick out NULL entries (no data to analyze)
    if(is.null(annualRollingExceedanceSummaryOutput)){
      z <- tibble(`_EXC` = as.numeric(NA), `_STAT` = as.character(NA))
      } else {  z <- tibble(`_EXC` = as.numeric(NA), `_STAT` = "S")} }
  names(z) <- paste0(parameterAbbreviation, names(z))
  return(z)
}

# Example Usage: (demonstrating nested function syntax and pipes, both result in the same output):
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
# rollingWindowSummary(
#   annualRollingExceedanceSummary(
#     annualRollingExceedanceAnalysis(
#       freshwaterNH3limit(stationData,
#                          trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                          mussels = TRUE,
#                          earlyLife = TRUE),
#       yearsToRoll = 3, aquaticLifeUse = FALSE)
#     ),
#   parameterAbbreviation = "AMMONIA"
# )
# freshwaterNH3limit(stationData,
#                    trout = ifelse(unique(stationData$CLASS) %in% c('V','VI'), TRUE, FALSE),
#                    mussels = TRUE,
#                    earlyLife = TRUE) %>%
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE) %>%
#   annualRollingExceedanceSummary() %>% 
#   rollingWindowSummary(parameterAbbreviation = "AMMONIA")


# Freshwater Chloride Example:
# rollingWindowSummary(
#         annualRollingExceedanceSummary(
#           annualRollingExceedanceAnalysis(
#             chlorideFreshwaterAnalysis(stationData),
#             yearsToRoll = 3, aquaticLifeUse = TRUE) ),
#         parameterAbbreviation = "CHL")
# chlorideFreshwaterAnalysis(stationData) %>%
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE) %>%
#   annualRollingExceedanceSummary() %>%
#   rollingWindowSummary(parameterAbbreviation = "CHL")

3.5.8 Water Column Toxics

Since there is no place in the Station Table for individual Public Water Supply, freshwater chloride, or water column PCB criteria results, the WAT_TOX fields summarize exceedances for PWS criteria, freshwater chloride aquatic life use, and water column PCB status. The code snippet below (see Automated Assessment Analysis) outlines how these criteria results are combined into a single field in the Station Table. It is highly advised that assessment staff use this field as a flag when reviewing a station summary from the Station Table using the appropriate waterbody-specific shiny assessment application to fully understand which criteria may be providing the WAT_TOX flag.

See the Public Water Supply Criteria, Freshwater Chloride, and PCB sections below for details on individual parameter functions demonstrated in the below chunk.

# code snippet from Automated Assessment Analysis
# where stationData objects are already in your environment, see Automated Assessment for environment set up help

 # PWS Human Health Criteria
  if(nrow(stationData) > 0){
    if(is.na(unique(stationData$PWS))  ){
      PWSconcat <- tibble(PWS= NA)
   } else {
     PWSconcat <- cbind(assessPWSsummary(assessPWS(stationData, NITRATE_mg_L, LEVEL_NITRATE, 10), 'PWS_Nitrate'),
                        assessPWSsummary(assessPWS(stationData, CHLORIDE_mg_L, LEVEL_CHLORIDE, 250), 'PWS_Chloride'),
                        assessPWSsummary(assessPWS(stationData, SULFATE_TOTAL_mg_L, LEVEL_SULFATE_TOTAL, 250), 'PWS_Total_Sulfate')) %>%
       dplyr::select(-ends_with('exceedanceRate')) }
    
  # Freshwater Chloride Aquatic Life assessment, if data exists
    if(nrow(filter(stationData, !is.na(CHLORIDE_mg_L))) > 0){
      chlorideFreshwater <- rollingWindowSummary(
        annualRollingExceedanceSummary(
          annualRollingExceedanceAnalysis(chlorideFreshwaterAnalysis(stationData), yearsToRoll = 3, aquaticLifeUse = TRUE) ), "CHL")
    } else {chlorideFreshwater <- tibble(CHL_EXC = NA, CHL_STAT= NA)}
    
  # Water toxics combination with PWS, Chloride Freshwater, and water column PCB data
  if(nrow(bind_cols(PWSconcat,
                    chlorideFreshwater,
                    PCBmetalsDataExists(filter(markPCB, str_detect(SampleMedia, 'Water')) %>%
                                        filter(StationID %in% stationData$FDT_STA_ID), 'WAT_TOX')) %>%
          dplyr::select(contains(c('_EXC','_STAT'))) %>%
          mutate(across( everything(),  as.character)) %>%
          pivot_longer(cols = contains(c('_EXC','_STAT')), names_to = 'parameter', values_to = 'values', values_drop_na = TRUE) %>% 
          filter(! str_detect(values, 'WQS info missing from analysis')) %>% 
          filter(! values == "S")) >= 1) {
    WCtoxics <- tibble(WAT_TOX_EXC = NA, WAT_TOX_STAT = 'Review')
    } else { WCtoxics <- tibble(WAT_TOX_EXC = NA, WAT_TOX_STAT = NA)} 
  
  } else {
    WCtoxics <- tibble(WAT_TOX_EXC = NA, WAT_TOX_STAT = NA)
  }

3.5.9 Public Water Supply Criteria

Public Water Supply (PWS) human health criteria are assessed using the assessPWS() function and summarized for the Station Table by assessPWSsummary() function. These functions operate across PWS parameters, using non-standard evaluation techniques to allow users to specify which parameter to evaluate and the associated criteria using function arguments. If the WQS metadata linked to the station is designated as a PWS segment, the assessPWS() function first removes all Level I and Level II data and missing data before the median of measures collected over a six year assessment period is calculated. The parameter six year median is rounded to even, compared against the provided criteria, and flagged if the median exceeds the provided criteria limit.

Chloride, Sulfate, Total Dissolved Solids, Iron, and Foaming Agents are secondary criteria and are only applicable to data collected at the drinking water intake. To be most protective, the automated assessment process evaluates all PWS designated segments for each of these criteria if data exists, regardless of the proximity of the station to the drinking water supply intake. It is the responsibility of the assessment staff to remove these decisions from the Station Table when they are not applicable. The waterbody-specific shiny assessment applications flag stations within 100 meters from a drinking water intake.

assessPWS <- function(stationData, fieldName, commentName, PWSlimit){
  if(unique(stationData$PWS) %in% c("Yes")){
    fieldName_ <- enquo(fieldName)
    commentName_ <- enquo(commentName)
    parameterData <- dplyr::select(stationData, FDT_DATE_TIME, !! fieldName_, !! commentName_) %>%
      filter(!( !! commentName_ %in% c('Level II', 'Level I'))) %>% # get lower levels out
      filter(!is.na(!!fieldName_ )) %>% #get rid of NA's
      mutate(`Parameter Median` = median(!! fieldName_),
             `Parameter Rounded to WQS Format` = signif(`Parameter Median`, digits = 2),  # two significant figures based on WQS https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section140/
             limit =  PWSlimit) %>%
      rename(parameter = !!names(.[5])) %>% # rename columns to make functions easier to apply
      mutate(exceeds = ifelse(parameter > limit, T, F)) # Identify where above WQS limit
    return(parameterData)
  }
  return(NULL)
}

# Example Usage:
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
#assessPWS(stationData, NITRATE_mg_L, LEVEL_NITRATE, 10)
#assessPWS(stationData, CHLORIDE_mg_L, LEVEL_CHLORIDE, 250)
#assessPWS(stationData, SULFATE_TOTAL_mg_L, LEVEL_SULFATE_TOTAL, 250)

The output of the assessPWS() function becomes the input for the assessPWSsummary() function when the user desires the results to be summarized for the Station Table.

assessPWSsummary <- function(assessPWSresults, 
                             outputName){
  if(!is.null(assessPWSresults)){
    return(quickStats(assessPWSresults, outputName))  
  }
  return(quickStats(tibble(limit = NA), outputName))
}

# Example Usage:
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
#assessPWSsummary(assessPWS(stationData, NITRATE_mg_L, LEVEL_NITRATE, 10), 'PWS_Nitrate')
#assessPWSsummary(assessPWS(stationData, CHLORIDE_mg_L, LEVEL_CHLORIDE, 250), 'PWS_Chloride')
#assessPWSsummary(assessPWS(stationData, SULFATE_TOTAL_mg_L, LEVEL_SULFATE_TOTAL, 250), 'PWS_Total_Sulfate')

3.5.10 Freshwater Chloride

Chloride is assessed for Aquatic Life Uses using acute and chronic criteria only in freshwater environments. The evaluation of chloride data against acute and chronic criteria is performed by the chlorideFreshwaterAnalysis() function while the summarization of results to conduct an assessment decision is performed by the nested annualRollingExceedanceAnalysis(), annualRollingExceedanceSummary(), and rollingWindowSummary() functions. These functions are detailed for ammonia, but the same logic rules apply to any parameter dataset provided to the function logic. An example of the freshwater chloride analysis and summarization for the Station Table are provided in the chunk below.

The chlorideFreshwaterAnalysis() function first ensures the station provided to the function is designated a freshwater or transitional before removing all Level I and Level II data and missing data. If data exist after those initial filters are applied, subsequent logic may be applied, if not, the function returns NULL. The function then creates storage objects for acute and chronic average results and loops through each row of data to correctly calculate acute criteria and find any chronic scenarios (if data exist). Each new sample event prompts the creation of acute and chronic data windows.

If sufficient data exist in any of these windows, the appropriate criteria are applied to the window chloride average, rounded to even. To be most protective, chronic windows are calculated if one or more measure exists in any data window; this information is useful for assessor visualization and understanding in the application environment, but these results based on only one measure are culled from assessment decisions in later functions. The data analyzed within each window is saved as a listcolumn called associatedData that may be extracted for investigation later. The output of the function is a long dataset where each unique sample event could have a row summarizing the acute and chronic criteria, sample count in the given window, exceedance result, and associated data (associatedData listcolumn).

chlorideFreshwaterAnalysis <- function(stationData){
  # doesn't apply in class II transition zone
  stationDataCHL <- filter(stationData, CLASS %in% c('III', "IV","V","VI","VII") |
                             CLASS ==  "II" & ZONE == 'Tidal Fresh') %>% 
    filter(!(LEVEL_CHLORIDE %in% c('Level II', 'Level I'))) %>% # get lower levels out
    filter(!is.na(CHLORIDE_mg_L)) #get rid of NA's
  
  if(nrow(stationDataCHL) > 0){
    
    
    # make a place to store analysis results
    acuteCriteriaResults <- tibble(FDT_STA_ID = as.character(NA), WindowDateTimeStart = as.POSIXct(NA), FDT_DEPTH = as.numeric(NA),
                                   Value = as.numeric(NA), ValueType = as.character(NA), 
                                   `Criteria Type` = as.character(NA), CriteriaValue = as.numeric(NA), 
                                   `Sample Count` = as.numeric(NA), 
                                   parameterRound = as.numeric(NA), Exceedance = as.numeric(NA),
                                   associatedData = list())
    chronicCriteriaResults <- acuteCriteriaResults
    
    # loop through each row of data to correctly calculate criteria and find any chronic scenarios
    for(k in stationDataCHL$FDT_DATE_TIME){  #k = stationDataCHL$FDT_DATE_TIME[1]
      acuteDataWindow <- filter(stationDataCHL,  between(FDT_DATE_TIME, k, k + hours(1)))
      chronicDataWindow <- filter(stationDataCHL,  between(FDT_DATE_TIME, k, k + days(4)))
      
      # Run acute analysis if data exists
      if(nrow(acuteDataWindow) > 0){
        acuteDataCriteriaAnalysis <- suppressMessages( 
          dplyr::select(acuteDataWindow, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, CHLORIDE_mg_L) %>% 
          group_by(FDT_STA_ID, FDT_DEPTH) %>% # can't group by datetime or summary can't happen
          summarise(Value = mean(CHLORIDE_mg_L, na.rm=T),  # get hourly average
                    `Sample Count` = length(CHLORIDE_mg_L)) %>%  #count sample that made up average
          mutate(ValueType = 'Hourly Average',
                 ID = paste( FDT_STA_ID, FDT_DEPTH, sep = '_'), # make a uniqueID in case >1 sample for given datetime
                 `Criteria Type` = 'Acute', 
                 CriteriaValue = 860) %>%  # 860,000ug/L criteria to mg/L
          mutate(parameterRound = signif(Value, digits = 2), # two significant figures based on WQS https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section140/
                 Exceedance = ifelse(parameterRound > CriteriaValue, 1, 0 ), # use 1/0 to easily summarize multiple results later
                 WindowDateTimeStart = min(acuteDataWindow$FDT_DATE_TIME)) %>% 
          dplyr::select(FDT_STA_ID, WindowDateTimeStart, everything(), -ID) ) %>% 
          bind_cols(tibble(associatedData = list(dplyr::select(acuteDataWindow, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, CHLORIDE_mg_L) ) ))
        # Save the results for viewing later
        acuteCriteriaResults <- bind_rows(acuteCriteriaResults, acuteDataCriteriaAnalysis) 
      } else {acuteCriteriaResults <- acuteCriteriaResults }
      
      # Run chronic analysis if data exists
      if(nrow(chronicDataWindow) > 0){
        chronicDataCriteriaAnalysis <- suppressMessages( 
          dplyr::select(chronicDataWindow, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, CHLORIDE_mg_L) %>% 
          group_by(FDT_STA_ID, FDT_DEPTH) %>% # can't group by datetime or summary can't happen
          summarise(Value = mean(CHLORIDE_mg_L, na.rm=T),  # get hourly average
                    `Sample Count` = length(CHLORIDE_mg_L)) %>%  #count sample that made up average
          mutate(ValueType = 'Four Day Average',
                 ID = paste( FDT_STA_ID, FDT_DEPTH, sep = '_'), # make a uniqueID in case >1 sample for given datetime
                 `Criteria Type` = 'Chronic', 
                 CriteriaValue = 230) %>%  # 230,000ug/L criteria to mg/L
          mutate(parameterRound = signif(Value, digits = 2), # two significant figures based on WQS https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section140/
                 Exceedance = ifelse(parameterRound > CriteriaValue, 1, 0 ), # use 1/0 to easily summarize multiple results later
                 WindowDateTimeStart = min(chronicDataWindow$FDT_DATE_TIME)) %>% 
          dplyr::select(FDT_STA_ID, WindowDateTimeStart, everything(), -ID) ) %>% 
          bind_cols(tibble(associatedData = list(dplyr::select(chronicDataWindow, FDT_STA_ID, FDT_DATE_TIME, FDT_DEPTH, CHLORIDE_mg_L) ) ))
        # Save the results for viewing later
        chronicCriteriaResults <- bind_rows(chronicCriteriaResults, chronicDataCriteriaAnalysis) 
      } else {chronicCriteriaResults <- chronicCriteriaResults }
    }
    # summarize results
    stationCriteriaResults <- bind_rows( acuteCriteriaResults, chronicCriteriaResults) %>% 
      filter(!is.na(FDT_STA_ID)) %>% # drop placeholder rows
      distinct(FDT_STA_ID, WindowDateTimeStart, FDT_DEPTH, `Criteria Type`, .keep_all = T) %>% # remove duplicates in case > 1 depth per datetime
      arrange(FDT_STA_ID, WindowDateTimeStart, FDT_DEPTH, `Criteria Type`)
    return(stationCriteriaResults)
  } else { return(NULL)}
}
# Example Usage:
# where stationData objects are already in your environment, see Automated Assessment for environment set up help
# chlorideFreshwaterAnalysis(stationData) 
# Example Usage for Station Table (demonstrating nested function syntax and pipes, both result in the same output):
# rollingWindowSummary(
#         annualRollingExceedanceSummary(
#           annualRollingExceedanceAnalysis(
#             chlorideFreshwaterAnalysis(stationData), 
#             yearsToRoll = 3, aquaticLifeUse = TRUE) ),  
#         parameterAbbreviation = "CHL")
# chlorideFreshwaterAnalysis(stationData) %>% 
#   annualRollingExceedanceAnalysis(yearsToRoll = 3, aquaticLifeUse = FALSE) %>% 
#   annualRollingExceedanceSummary() %>% 
#   rollingWindowSummary(parameterAbbreviation = "CHL")

3.5.11 PCB

The PCBmetalsDataExists() function is used primarily for flagging whether or not certain data exist when building the Station Table. Most stations evaluated during an assessment period do not have metals or toxics data, so it is important to flag when data require review. This particular function can be used to flag when data exist across water column PCB, sediment PCB, fish tissue metals, and fish tissue PCB data. When data exists for these toxics parameters, regional assessment staff may investigate them against applicable thresholds inside the waterbody-specific shiny assessment application.

## Identify if further review needs to happen for PCB or metals data 
PCBmetalsDataExists <- function(datasetType, # any of the preorganized PCB or fish tissue datasets
                                parameterType # field name to pass through to Station Table output
                                ){
  # if any data given to function
  if(nrow(datasetType) > 0){ 
    x <- data.frame(EXC = NA, STAT = 'Review')
  }else {
    x <- data.frame(EXC = NA, STAT = NA) }
  
  names(x) <- paste(parameterType, names(x), sep='_')
  return(x)
}

# Example Usage:
# where markPCB, fishMetals, fishPCB objects are already in your environment, see Automated Assessment for environment set up help
# PCBmetalsDataExists(filter(markPCB, str_detect(SampleMedia, 'Water')) %>%
#                       filter(StationID %in% '2-JKS023.61'), 'WAT_TOX')
# PCBmetalsDataExists(filter(markPCB, str_detect(SampleMedia, 'Sediment')) %>%
#                       filter(StationID %in% '2-JKS023.61'), 'SED_TOX')
# PCBmetalsDataExists(filter(fishMetals, Station_ID %in% '2-JKS023.61'), 'FISH_MET')
# PCBmetalsDataExists(filter(fishPCB, `DEQ rivermile` %in%  '2-JKS023.61'), 'FISH_TOX')

3.5.12 Metals

For a general flag provided to the Station Table to determine whether metals data do/do not exist, please see PCB above. When data exists for these sediment or water column metals parameters, regional assessment staff may investigate them against applicable thresholds inside the waterbody-specific shiny assessment application.

Hang tight til Joe’s update

3.5.13 Fish Tissue

For a general flag provided to the Station Table to determine whether fish tissue data do/do not exist, please see PCB above. When data exists for these fish tissue metals or PCB data, regional assessment staff may investigate them against applicable thresholds inside the waterbody-specific shiny assessment application.

3.5.14 Benthics

The benthicAssessment() function is used primarily for flagging whether or benthic data exist when building the Station Table. Most stations evaluated during an assessment period do not have benthic macroinvertebrate data, so it is important to flag when data require review.

Benthic assessments are completed by regional biologists to ensure the appropriate SCI and professional judgment are used for assessments. The biological staff review all samples within a given assessment period using the Bioassessment Dashboard and complete a bioassessment for submission via the Bioassessment Fact Sheet Generator. This information is archived for future biological, assessment, and TMDL staff as well as provided seamlessly within the Riverine Assessment Application.

The regional assessment staff know to look for bioassessment information based on the ‘Review’ flag that the benthicAssessment() function provides to the Station Table output.

# Benthic Data flag
benthicAssessment <- function(stationData,
                              VSCIresults # pinned dataset with all BenSamps run against just VSCI
                              ){
  # this works because the all SCI options are run on all data, so if there is a VSCI result 
  # (even if in real life that is not the correct SCI to use), then benthic data exists for
  # a given station
  benthicDataExist <- filter(VSCIresults, StationID %in% unique(stationData$FDT_STA_ID))
  if(nrow(benthicDataExist) > 0){tibble(BENTHIC_STAT = 'Review')
  } else { tibble(BENTHIC_STAT = NA)}
}

# Example Usage:
# where stationData and VSCIresults objects are already in your environment, see Automated Assessment for environment set up help
# benthicAssessment(stationData, VSCIresults)

3.5.15 Additional Functions

The functions below are helper functions but do not take a starring role in the automated assessment process; however, without these functions, one could not perform an automated assessment. Each function is detailed below.

3.5.15.1 assessmentPeriod

The assessmentPeriod and assessmentCycle objects are not functions but incredibly useful vectors. These objects hold assessment window and cycle name information that allow for significantly easier updates from cycle to cycle. By sourcing this information throughout multiple dependent arguments and functions, one can easily make cycle changes when a new cycle is ready for assessment.

# Establish Assessment Period, update each cycle
assessmentPeriod <- as.POSIXct( c("2017-01-01 00:00:00 UTC", "2022-12-31 23:59:59 UTC"), tz='UTC')
assessmentCycle <- '2024'

# Example Usage:
# pin_get("ejones/VSCIresults", board = "rsconnect") %>%
#   filter( between(`Collection Date`, assessmentPeriod[1], assessmentPeriod[2]) )

3.5.15.2 WQSvalues

The WQSvalues object is a tibble that allows for easy WQS updates should these values change over time. This data is joined to individual station data by WQS Class information to easily link measured parameters to appropriate criteria. Follow up functions refine the criteria information should a station’s WQS specify any special standards.

# WQS information for functions
# From: 9VAC25-260-50. Numerical Criteria for Dissolved Oxygen, Ph, and Maximum Temperature
# https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section50/
WQSvalues <- tibble(CLASS_BASIN = c('I',"II","II_7","III","IV","V","VI","VII"),
                    CLASS = c('I',"II","II","III","IV","V","VI","VII"),
                    `Description Of Waters` = c('Open Ocean', 'Tidal Waters in the Chowan Basin and the Atlantic Ocean Basin',
                                                'Tidal Waters in the Chesapeake Bay and its tidal tributaries',
                                                'Nontidal Waters (Coastal and Piedmont Zone)','Mountainous Zone Waters',
                                                'Stockable Trout Waters','Natural Trout Waters','Swamp Waters'),
                    `Dissolved Oxygen Min (mg/L)` = c(5,4,NA,4,4,5,6,NA),
                    `Dissolved Oxygen Daily Avg (mg/L)` = c(NA,5,NA,5,5,6,7,NA),
                    `pH Min` = c(6,6,6.0,6.0,6.0,6.0,6.0,3.7),
                    `pH Max` = c(9.0,9.0,9.0,9.0,9.0,9.0,9.0,8.0),
                    `Max Temperature (C)` = c(NA, NA, NA, 32, 31, 21, 20, NA)) %>%
  mutate(CLASS_DESCRIPTION = paste0(CLASS, " | ", `Description Of Waters`))

# Example Usage:
# where stationTable is already joined to citmonWQS and WQSlookup, see Automated Assessment for environment set up help
# stationTable %>%
#   # Fix for Class II Tidal Waters in Chesapeake (bc complicated DO/temp/etc standard)
#   mutate(CLASS_BASIN = paste(CLASS,substr(BASIN, 1,1), sep="_")) %>% # (2)
#   mutate(CLASS_BASIN = ifelse(CLASS_BASIN == 'II_7', "II_7", as.character(CLASS))) %>% # (2)
#   # Join actual WQS criteria to each StationID
#   left_join(WQSvalues, by = 'CLASS_BASIN')

3.5.15.3 lakeNameStandardization

This helper function allows for easier lake name cleanup to allow for joins on lake name fields between disparate data types. By storing this crosswalk information in a single function, it can be easily updated if lake names change and applied to multiple scripts easily.

# Lake name standardization
lakeNameStandardization <- function(x){
  x %>% 
    mutate(Lake_Name = case_when(WATER_NAME %in% c('Abel Lake Reservoir (Long Branch)') ~ 'Abel Lake',
                                 WATER_NAME %in% c('Big Cherry Reservior') ~ 'Big Cherry Lake',
                                 WATER_NAME %in% c('Dan River','Buffalo Creek','Bluestone Creek') ~ 'Kerr Reservoir',
                                 WATER_NAME %in% c('Smith Lake (Aquia Reservoir)') ~ 'Aquia Reservoir (Smith Lake)',
                                 WATER_NAME %in% c('Claytor Lake (New River)', 'Claytor Lake (Peak Creek)',
                                                   'Claytor Lake Lower (New River)') ~ 'Claytor Lake',
                                 WATER_NAME %in% c('Fairystone Lake (Goblin Town Creek)') ~ 'Fairystone Lake', 
                                 WATER_NAME %in% c('Harwood Mill Reservoir (PWS)') ~ 'Harwood Mills Reservoir',
                                 WATER_NAME %in% c('Goose Creek') ~ 'Goose Creek Reservoir',
                                 WATER_NAME %in% c('Lake Anna', 'Lake Anna/Contrary Creek', 'Lake Anna/Freshwater Creek', 
                                                   'Lake Anna/Gold Mine Creek', 'Lake Anna/Pamunkey Creek', 
                                                   'Lake Anna/Plentiful Creek', 'Terrys Run/Lake Anna') ~ 'Lake Anna',
                                 WATER_NAME %in% c('Lake Cohoon (PWS)') ~ 'Lake Cohoon',     
                                 WATER_NAME %in% c('Conner Lake') ~ 'Lake Conner',
                                 WATER_NAME %in% c('Lake Kilby (PWS)') ~ 'Lake Kilby',
                                 WATER_NAME %in% c('Lake Meade (PWS)','Pitch Kettle Creek - Lake (PWS)') ~ 'Lake Meade',
                                 WATER_NAME %in% c('Lake Moomaw (Jackson River)','Lake Moomaw Middle (Jackson River)') ~ 'Lake Moomaw',
                                 WATER_NAME %in% c('Lake Prince - Reservoir (PWS)') ~ 'Lake Prince',
                                 WATER_NAME %in% c('Lake Shenandoah') ~ 'Shenandoah Lake',
                                 WATER_NAME %in% c('Lake Smith (PWS)') ~ 'Lake Smith',
                                 WATER_NAME %in% c('Lake Whitehurst (PWS)') ~ 'Lake Whitehurst',
                                 WATER_NAME %in% c('Leesville Lake', 'Leesville Lake (Pigg R.)', 
                                                   'Leesville Lake Middle (Roanoke R.)') ~ 'Leesville Lake',
                                 WATER_NAME %in% c('Lee Hall Reservoir- Upper, Middle','Lee Hall Reservoir-Lower') ~ 'Lee Hall Reservoir',
                                 WATER_NAME %in% c('Little Creek Reservoir - (PWS)') ~ 'Little Creek Reservoir (VBC)',
                                 WATER_NAME %in% c('Little Creek Reservoir (PWS)') ~ 'Little Creek Reservoir (JCC)',
                                 WATER_NAME %in% c('Lonestar Lake F (PWS)') ~ 'Lone Star Lake F (Crystal Lake)', 
                                 WATER_NAME %in% c('Lone Star Lake G (PWS)') ~ 'Lone Star Lake G (Crane Lake)', 
                                 WATER_NAME %in% c('Lone Star Lake I (PWS)') ~ 'Lone Star Lake I (Butler Lake)', 
                                 WATER_NAME %in% c('Martinsville (Beaver Creek) Reservoir',
                                                   'Martinsville Reservoir') ~ 'Martinsville Reservoir (Beaver Creek Reservoir)',
                                 WATER_NAME %in% c('Moormans River') ~ 'Sugar Hollow Reservoir',
                                 WATER_NAME %in% c('North River') ~ 'Staunton Dam Lake',
                                 WATER_NAME %in% c('Philpott Reservoir (Goblin Town Creek)', 'Philpott Reservoir Lower (Smith River)',
                                                   'Philpott Reservoir (Smith River)') ~ "Philpott Reservoir",
                                 WATER_NAME %in% c('Roanoke River','Lake Gaston') ~ 'Lake Gaston',     
                                 WATER_NAME %in% c('Roaring Fork Reservoir') ~ 'Roaring Fork',
                                 WATER_NAME %in% c('S F Rivanna River Reservoir') ~ 'Rivanna Reservoir (South Fork Rivanna Reservoir)',
                                 str_detect(WATER_NAME, 'Smith Mtn. Lake') ~ 'Smith Mountain Lake', 
                                 WATER_NAME %in% c('Speights Run - Lake (PWS)') ~ 'Speights Run Lake',
                                 WATER_NAME %in% c('Troublesome Reservoir') ~ 'Troublesome Creek Reservoir',
                                 WATER_NAME %in% c('Waller Mill Reservoir [PWS]') ~ 'Waller Mill Reservoir',
                                 WATER_NAME %in% c('Unnamed pond near Tanyard Swamp') ~ 'Tanyard Swamp',
                                 WATER_NAME %in% c('Unsegmented lakes in G03') ~ 'West Run',
                                 TRUE ~ as.character(WATER_NAME))) 
}

# Example Usage:
# where stationTable is already joined to citmonWQS, WQSlookup, WQSvalues, and WQMstationFull, see Automated Assessment for environment set up help
# stationTable %>% 
#   lakeNameStandardization()  

3.5.15.4 quickStats

This helper function is essential to the automated assessment process. The function takes in parameter data with exceedances identified and sample counts and simplifies the data into a suggested assessment decision, where appropriate. The parameter argument allows users to pass through a character string they want to be included in the Stations Table field naming format. The drop7Q10 argument is default set to false for automated assessment purposes, but the shiny applications use this feature to quickly rerun the exceedance math should a user decide that the 7Q10 flag should be used to remove data for a station and parameter combination.

quickStats <- function(parameterDataset, # dataset from one station that has been run through a parameter exceedance analysis function 
                       #                  (e.g. tempExceedances(), DOExceedances_Min(), pHexceedances())
                       parameter, # the name of the parameter as you want it to appear in final stations table output
                       drop7Q10 = FALSE # drop any records from this analysis with a 7Q10 flag, default is false but can be overridden in apps
                       ){
  # Drop any 7Q10 flagged data from exceedance analyses for appropriate parameters
  if(drop7Q10 == TRUE &
     parameter %in% c("TEMP", "DO", "DO_Daily_Avg", "PH")){
    parameterDataset <- dplyr::filter(parameterDataset, is.na(`7Q10 Flag`)) # only assess on assessable dataset
  }
  
  if(nrow(parameterDataset) > 0 & any(!is.na(parameterDataset$limit))){
    results <- data.frame(EXC = nrow(dplyr::filter(parameterDataset, exceeds == TRUE)),
                          SAMP = nrow(parameterDataset)) %>%
      # Implement Round to Even on Exceedance Frequency
      dplyr::mutate(exceedanceRate = as.numeric(round::roundAll((EXC/SAMP)*100,digits=0, "r0.C"))) # round to nearest whole number per Memo to Standardize Rounding for Assessment Guidance
    
    if(results$EXC >= 1){outcome <- 'Review'} # for Mary
    if(results$EXC >= 1 & results$exceedanceRate < 10.5){outcome <- 'Review'}
    if(results$exceedanceRate > 10.5 & results$EXC >= 2 & results$SAMP > 10){outcome <- '10.5% Exceedance'}
    if(results$EXC < 1 &results$exceedanceRate < 10.5 & results$SAMP > 10){outcome <- 'S'}
    if(results$EXC >= 1 & results$SAMP <= 10){outcome <- 'Review'}
    if(results$EXC < 1 & results$SAMP <= 10 & results$SAMP > 1){outcome <- 'S'} # & results$SAMP >1 new 12/21/2020 can't say supporting on 1 sample
    if(results$EXC < 1 & results$SAMP <= 10 & results$SAMP == 1){outcome <- 'Review'} # & results$SAMP >1 new 12/21/2020 can't say supporting on 1 sample
    
    
    results <- dplyr::mutate(results, STAT = outcome)
    names(results) <- c(paste(parameter,names(results)[1], sep = '_'),
                        paste(parameter,names(results)[2], sep = '_'),
                        paste(parameter,names(results)[3], sep = '_'),
                        paste(parameter,names(results)[4], sep = '_'))
    #rename based on parameter entered
    return(results)
  } else {
    if(nrow(parameterDataset) == 0){
      z <- data.frame(EXC = NA, SAMP= nrow(parameterDataset), exceedanceRate= NA, STAT= NA)
      names(z) <- paste(parameter,names(z), sep='_')
      return(z)
    } else {
      z <- data.frame(EXC = NA, SAMP= nrow(parameterDataset), exceedanceRate= NA, STAT= paste(parameter, 'WQS info missing from analysis'))
      names(z) <- paste(parameter,names(z), sep='_')
      return(z)
    }
    
  }
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# tempExceedances(stationData) %>% 
#   quickStats('TEMP')

3.5.15.5 StationTableStartingData

This helper function organizes existing site metadata into a format that matches the desired Station Table output format.

StationTableStartingData <- function(stationData){
  stationData %>%
    dplyr::select(FDT_STA_ID, ID305B_1:VAHU6) %>%
    dplyr::rename('STATION_ID' = 'FDT_STA_ID') %>%
    dplyr::distinct(STATION_ID, .keep_all = T)
}

# Example Usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# StationTableStartingData(stationData)

3.5.15.6 temperatureSpecialStandardsCorrection

This helper function is applied after WQSvalues() is run on a tibble of station data. The function corrects the coarser WQS value joined by Class by identifying if any temperature special standards are attributed to a station. If so, the function then adjusts each Max Temperature (C) record based on temporal criteria accordingly.

# temperature special standards adjustment function
# this is applied to the actual monitoring data because there is a temporal component to these criteria
temperatureSpecialStandardsCorrection <- function(x){
  # ee. Maximum temperature for these seasonally stockable trout waters is 26°C and applies May 1 through October 31. https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section310/
  # ff. Maximum temperature for these seasonally stockable trout waters is 28°C and applies May 1 through October 31. https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section310/
  # hh. Maximum temperature for these seasonally stockable trout waters is 31°C and applies May 1 through October 31. https://law.lis.virginia.gov/admincode/title9/agency25/chapter260/section310/
  mutate(x, `Max Temperature (C)` = case_when(str_detect(as.character(SPSTDS), 'ee') & month(FDT_DATE_TIME) %in% 5:10 ~ 26,
                                              str_detect(as.character(SPSTDS), 'ff') & month(FDT_DATE_TIME) %in% 5:10 ~ 28,
                                              str_detect(as.character(SPSTDS), 'hh') & month(FDT_DATE_TIME) %in% 5:10 ~ 31,
                                              TRUE ~ `Max Temperature (C)`)) 
}

# Example Usage:
# where conventionals and stationTable objects are already in your environment, see Automated Assessment for environment set up help
# stationData <- filter(conventionals, FDT_STA_ID %in% '2-JKS023.61') %>%
#     left_join(stationTable, by = c('FDT_STA_ID' = 'STATION_ID')) %>%
#     temperatureSpecialStandardsCorrection()

3.5.15.7 pHSpecialStandardsCorrection

This helper function is applied after WQSvalues() is run on a tibble of station data. The function corrects the coarser WQS value joined by Class by identifying if any pH special standards are attributed to a station. If so, the function then adjusts each pH Min and pH Max record based on applicable criteria accordingly.

pHSpecialStandardsCorrection <- function(stationData){
  mutate(stationData, `pH Min` = case_when(str_detect(as.character(SPSTDS), '6.5-9.5') ~ 6.5, TRUE ~ `pH Min`),
         `pH Max` = case_when(str_detect(as.character(SPSTDS), '6.5-9.5') ~ 9.5, TRUE ~ `pH Max`))
}
# Example Usage:
# where conventionals and stationTable objects are already in your environment, see Automated Assessment for environment set up help
# stationData <- filter(conventionals, FDT_STA_ID %in% '2-JKS023.61') %>%
#     left_join(stationTable, by = c('FDT_STA_ID' = 'STATION_ID')) %>%
#     pHSpecialStandardsCorrection()

3.5.15.8 stationTableComments

This helper function parses station level comments from the two most recent IR Station Table outputs. This information is joined into the current IR Station Table, despite not fitting with the CEDS Bulk Data Upload template, in order to provide regional assessment staff with historical station information adjacent to current station information.

stationTableComments <- function(stations, 
                                 previousStationTable, 
                                 previousStationTableCycle,
                                 previousStationTable2,
                                 previousStationTable2Cycle){
  lastComment <- filter(previousStationTable, `Station Id` %in% stations) %>%
    dplyr::select(`Station Id`, Comments)
  names(lastComment) <- c('STATION_ID', paste(previousStationTableCycle, 'IR COMMENTS'))
  lastComment2 <- filter(previousStationTable2, `Station Id` %in% stations) %>%
    dplyr::select(`Station Id`, Comments)
  names(lastComment2) <- c('STATION_ID', paste(previousStationTable2Cycle, 'IR COMMENTS'))
  return(left_join(lastComment, lastComment2, by= 'STATION_ID'))
}

# Example Usage:
# where stationsTable2022 and stationTable2020 are objects containing previous stations table spreadsheets read into your R environment, see Automated Assessment for environment set up help
# stationTableComments('2-JKS023.61', stationsTable2022, '2022', stationsTable2020, '2020')

3.5.15.9 thermoclineDepth

This helper function is essential for any lake station. The function takes in all conventionals data for a particular station and calculates a thermocline depth for each sample date. This depth is then used to attribute every sample as either “Epilimnion,” “Hypolimnion,” or unstratified (NA).

# Calculate daily thermocline depth and designate Epilimnion vs Hypolimnion
thermoclineDepth <- function(stationData){
  stationData <- stationData %>%
    mutate(SampleDate = as.Date(FDT_DATE_TIME)) %>%
    group_by(FDT_STA_ID, SampleDate) 
  
  dailyThermDepth <- dplyr::select(stationData, FDT_STA_ID, SampleDate, FDT_DEPTH, FDT_TEMP_CELCIUS) %>%
    mutate(DepthDiff = c(NA, diff(FDT_DEPTH)),
           TempDiff = c(NA, diff(FDT_TEMP_CELCIUS))) %>%
    filter(DepthDiff == 1) # get rid of changes less than 1 meter depth
  # Alt route in case shallow lake
  if(nrow(dailyThermDepth) > 0){
    dailyThermDepth <- filter(dailyThermDepth, TempDiff <= -1)
    # one more catch if no thermocline established
    if(nrow(dailyThermDepth) > 0){
      dailyThermDepth <- summarise(dailyThermDepth, ThermoclineDepth = min(FDT_DEPTH) - 0.5) %>% ungroup() 
    } else {
      dailyThermDepth <- summarise(stationData, ThermoclineDepth = NA) %>% ungroup()  }
  } else {
    dailyThermDepth <- summarise(stationData, ThermoclineDepth = NA) %>% ungroup() }
    
  
  full_join(stationData, dailyThermDepth, by = c('FDT_STA_ID', 'SampleDate')) %>%
    mutate(LakeStratification= ifelse(FDT_DEPTH < ThermoclineDepth,"Epilimnion","Hypolimnion"))%>% ungroup() 
}

# Example usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
# stationData %>% thermoclineDepth()

3.5.15.10 lowFlowFlagColumn

This helper function is new for IR2024. The function provides a flag field to the Station Table output only if a station has temperature, DO, or pH data collected on day with gage in subbasin is below 7Q10.

# Function to add a column to stations table output that summarizes 7Q10 information for relevant parameters
lowFlowFlagColumn <- function(stationData){
  lowFlowFlag <- filter(stationData, !is.na(`7Q10 Flag`))
  if(nrow(lowFlowFlag) > 0){
    withParameterData <- dplyr::select(lowFlowFlag, FDT_STA_ID, FDT_DATE_TIME, 
                                       FDT_TEMP_CELCIUS, DO_mg_L, FDT_FIELD_PH, `7Q10 Flag Gage`) %>% 
      pivot_longer(-c(FDT_STA_ID, FDT_DATE_TIME, `7Q10 Flag Gage`), names_to = 'parameter', values_to = 'value') %>% 
      filter(!is.na(value)) # drop rows with no data for specific parameters
    if(nrow(withParameterData) > 0){
      gageFlags <- dplyr::select(withParameterData, FDT_STA_ID, FDT_DATE_TIME,  `7Q10 Flag Gage`) %>% 
        distinct(FDT_DATE_TIME, .keep_all = T) %>% 
        summarise(STATION_ID = FDT_STA_ID, 
                  `7Q10 Flag` = paste(as.Date(FDT_DATE_TIME),  `7Q10 Flag Gage`, collapse = '; ')) %>% 
        mutate(`7Q10 Flag` = paste('USGS Gages in subbasin below 7Q10: ', `7Q10 Flag`)) 
    } else {
      gageFlags <- tibble(STATION_ID = unique(lowFlowFlag$FDT_STA_ID),
                          `7Q10 Flag` = NA_character_)   }
  } else {
    gageFlags <- tibble(STATION_ID = unique(lowFlowFlag$FDT_STA_ID),
                        `7Q10 Flag` = NA_character_)
    }
  return(gageFlags)
}

# Example usage:
# where stationData is already in your environment, see Automated Assessment for environment set up help
#lowFlowFlagColumn(stationData)

3.6 Automated Output

This dataset is the end result of the automated assessment process. It is saved as a .csv and used for upload to both the riverine and lacustrine applications. The reason the data are stored in .csv and not on the server is to enable individual assessors to adjust AU information manually whenever needed (e.g. if an AU needs to be split). If this data were stored on the server, then the Assessment Data Analyst would need to be involved any time an assessor needed to adjust which AU(s) a station is connected to. This data is available for download from each of the riverine and lacustrine assessment applications.

4 Regional Metadata Validation
Tool How To

The Regional Metadata Validation Tool is located on the R server.

The purpose of this tool is to streamline the QA of metadata attached to individual stations that was performed by automated scripts. This metadata include Water Quality Standards (WQS) and Assessment Unit (AU) information that are necessary to properly analyze and organize assessment results. A scripted process automates the data organization of stations from disparate data sources, spatially joins these sites to the appropriate WQS and AUs where information is not available from previous cycles, and runs basic QA on the output to ensure all stations that were expected to receive metadata did in fact get attributed. These scripts are detailed in the Metadata Attribution section.

It is the responsibility of each regional assessor to review the metadata linked through the automated snapping process to ensure the closest snapped information does in fact apply to the given station. This application offers an opportunity to perform QA on the spatial WQS and AU layers and communicate with Central Office Assessment/WQS staff when mistakes are found.

4.1 Application Orientation

This application expedites the review of the spatially joined output by organizing stations into the appropriate waterbody type, Assessment Region, and subbasin. The application is divided into two tabs (corresponding to AU review and WQS review) that can be navigated between using the navigation bar at the top of the page. The application opens to the Assessment Unit Review tab by default, but there is no specified order in which a reviewer must interact with either tab. Data for each tab are saved back to the R server independently. Each of these tabs have a drop down option that is accessible by clicking the triangle adjacent to the name of the tab. These options are identical for both the Assessment Unit QA and Water Quality Standards QA tabs. The top option called Watershed Selection allows users to choose certain types of stations to review and the bottom option called Manual Review allows users to visualize the stations selected in the Watershed Selection tab option in more detail. The final tab called “About” offers instructions to the user similar to what is detailed below.

The following application workflow is applicable to both the Assessment Unit QA and Water Quality Standards QA sections of the application. For brevity, this user guide will only go over instructions for the WQS side of the application, but the application flow applied to the WQS side of the application mimics the application flow applied on the AU side of the application.

4.2 Application Use:
Water Quality Standards QA-
Watershed Selection Tab

On load, the tab will provide three related drop down menus on the left hand side of the page. These interact in a top-down hierarchical system where the selection in the top drop down, Select Waterbody Type, dictates available options in the middle drop down, Select DEQ Assessment Region, which then dictates the available options in the bottom drop down, Select Subbasin. Users navigate through these cascading filters to identify the precise waterbody type (e.g. riverine, lacustrine, or estuarine), assessment region, and subbasin to tackle QA in a given session.

Once the user is happy with their selection, they must click the Begin Review With Subbasin Selection (Retrieves Last Saved Result) button. This button uses the user provided information to retrieve all the applicable stations that require QA. This information is sourced from the R server in addition to all the applicable spatial layers necessary to produce meaningful maps in the subsequent Manual Review tab option. Once this information is retrieved from the R server, a basic interactive map of the selected subbasin populates the tab as well as verbal breakdowns of the type of information that needs to be reviewed for the selected subbasin (Preprocessing Data Recap for Selected Region/Subbasin/Type Combination).

Based on the summarized information about the stations in the selected subbasin, the user may choose to begin reviewing that subbasin or they might scroll through other waterbody type/region/subbasin combinations until a more desirable “to do list” is identified. Each time a user changes any of the drop down inputs, they must press the Begin Review With Subbasin Selection (Retrieves Last Saved Result) button to retrieve the associated station breakdown from the chosen subbasin. A small progress bar appears in the lower right corner of the browser window to communicate that large spatial data are being transmitted to the app. The popup disappears when the map re-renders and loads the new spatial information.

Once a user is ready to review the stations in the selected subbasin, the user must navigate to the Manual Review tab under the Water Quality Standards QA drop down on the navigation bar.

4.3 Application Use:
Water Quality Standards QA-
Manual Review Tab

On load, the Manual Review tab renders three buttons, an interactive map of Virginia, and an area for output information from the map that is blank until user input dictates.

4.3.1 Map Orientation

The map allows users to zoom and pan interactively with the mouse. In the top left corner there are a number of tools to aid users. The plus and minus buttons offer fixed zoom in/out. The home button returns the map orientation to the original zoom level in case a user gets lost. The magnifying glass button allows users to search for and zoom to a selected station. To use this feature, click on the magnifying glass, begin typing the StationID of interest in the text box (after each character entered, a dropdown of options based on the characters entered appears), once a station is selected, the map will zoom to that station and highlight it with a red circle. You may notice that if there isn’t the All Stations in Basin layer on, then no station point appears inside the red circle. Additional layers available in the map (by hovering over the layers button on the left side) include All Stations in Basin (all other stations identified in the basin with data for the given IR window) and Assessment Regions (regional assessment boundaries).

4.3.2 What do the buttons do?

The three buttons correspond to the station snapping summaries presented on the Watershed Selection tab. In the example above, we can see this subbasin has: * There are 8 stations that snapped to 1 WQS segment in preprocessing. * There are 1 stations that snapped to > 1 WQS segment in preprocessing. * There are 1 stations that snapped to 0 WQS segments in preprocessing.

By clicking the Plot stations that snapped to 1 WQS Segment, users will see all the stations that only retrieved one WQS segment in the spatial snapping process. They are colored from green to light red based on the buffer distance required to snap to a segment (within the smallest set buffer distance of the sequence dictated to the automated snapping functions). Green stations snapped to segments at closer distances compared to red sites. The legend for the buffer distance stations colors is presented in the right corner of the app.

By clicking the Plot stations that snapped to >1 WQS Segment, users will see all the stations that retrieved multiple WQS segments in the spatial snapping process (within the smallest set buffer distance of the sequence dictated to the automated snapping functions). They are colored to red to indicate these stations need more attention.

By clicking the Plot stations that snapped to 0 WQS Segment, users will see all the stations that retrieved no WQS segments in the spatial snapping process (within the largest buffer distance of the sequence dictated to the automated snapping functions). They are colored to orange to indicate these stations need the most attention.

If any of the categories listed above had 0 stations, a warning message will appear in the lower right corner of the app indicating there is nothing for the button to plot.

4.3.3 Application Operation

The goal of this section of the application is to review each station and either accept or adjust the provided WQS information snapped to the site. To review a site, users may zoom to a selected map area and click on a site. By clicking a site, users are selecting a site to QA, as indicated by the green halo around the selected site. The associated metadata is now populated in the Stations Data and Spatially Joined WQS tab below the map.

The Selected Station Information table details information about the station including: * the WQS_ID of snapped segment(s) * the Buffer Distance (distance required to snap to the linked segment(s)) * n (the number of segments linked to the station during spatial processing) * a hyperlink to the internal GIS Web Application that pulls up the appropriate WQS layer and station information in a new browser tab * additional station metadata information

The Spatially Joined WQS Information table details information about the WQS segment(s) attached to the selected station including: * WQS_ID (unique code attributed to each WQS segment) * GNIS_Name (name of the WQS segment) * SEC (WQS section) * CLASS (WQS class) * SPSTDS (WQS special standards) * SECTION_DESCRIPTION (narrative description of the section location) * PWS (whether or not the section is designated as a Public Water Supply) * Tier_III (whether or not the section is designated as a Tier III water) * additional WQS segment information

Users may click on more than one station and subsequently reveal information about more than one station in the Stations Data and Spatially Joined WQS tab below the map. If more than one row is presented in a table, a y scroll bar on the right side of the table will automatically appear such that the user may access all the table information.

4.3.3.1 Application Operation: Clear Selection

To deselect a station or stations, click the blue Clear Selection button below the map. `

4.3.4 Application Operation:
Station that Snapped to 1 WQS Segment

Should a user select a station for review that snapped to 1 WQS segment, the process for QAing that station is simple. With the station highlighted with the green halo, the user reviews all the information presented below the map in the Stations Data and Spatially Joined WQS tab. Additionally, when the Plot stations that snapped to 1 WQS Segment button was pressed, additional spatial layers became available in the layers drop down of the map. The user may now turn on the layer called WQS Segments of Stations Snapped to 1 Segment to reveal all the segments in that category. These segments may be clicked on in the map to reveal the same information from the Selected Station Information table in a popup on the map.

4.3.4.1 Application Operation: Accept

If the user is satisfied with the snapped WQS segment, they may press the blue Accept button to reveal a modal window named Accept Snapped WQS. This modal shows the information from the Selected Station Information table one last time for review. There is a textbox below the table where users may input any additional comments and documentation to archive with the snapped StationID and WQS segment information. This could be why the selection was made, notes to future assessors about this site/segment, or nothing at all. At this point the user may navigate away from the modal by pressing Dismiss or Cancel if they choose to not proceed with that station.

Once the user presses the Accept button in the Accept Snapped WQS modal, the station and WQS segment are linked, the modal disappears, the station changes to a purple (completed) color, and the count of stations “to do” in that category at the top of the page goes down by one.

4.3.4.2 Application Operation: Manual WQS Adjustment (1 segment)

If the user is not satisfied with the snapped WQS segment, they may press the blue Manual WQS Adjustment button to reveal a modal window named Manually Adjust WQS. This modal shows the information from the Selected Station Information table one last time for review. There are two textboxes below the table where users may input the desired WQS_ID to attach to the StationID and any additional comments and documentation to archive with the snapped StationID and WQS segment information. This could be why the selection was made, notes to future assessors about this site/segment, or nothing at all. At this point the user may navigate away from the modal by pressing Dismiss or Cancel if they choose to not proceed with that station.

It is highly recommended that users copy and paste the desired WQS_ID from either this application or the GIS Web App to ensure the correct WQS_ID information is entered into the text box.

Pro Tip: By clicking the record in the WQS_ID cell of the table on the modal window three time quickly, the cell contents will be highlighted. Users may use the keyboard shortcuts to copy(ctrl+c)/paste(ctrl+v) this information into the text box below the table.

Once the user presses the Accept button in the Manually Adjust WQS modal, the station and manually entered WQS segment are linked, the modal disappears, the station changes to a purple (completed) color, and the count of stations “to do” in that category at the top of the page goes down by one.

4.3.4.3 Application Operation: Export Reviews

All the work the user has completed so far accepting or manually adjusting WQS are only saved in the local application environment. The only way to preserve the work completed in the application is to press the green Export reviews button. It is highly recommended that users click the Export reviews button frequently (every 3-4 stations or after a particularly challenging review) to ensure their work is in fact saved on the server.

Users may “undo” any accidental work still in their local application environment by either closing the app or returning to the Watershed Selection tab without pressing the Export reviews button. This work flow would look like:

  1. User selects station, Accept
  2. User selects station, Accept
  3. User selects station, Accept
  4. User presses Export reviews to save work to the R server
  5. User selects station, Accept- user realizes they made a mistake and doesn’t want to save that accept to the server
  6. User closes the app or navigates back to the Watershed Selection tab without pressing the Export reviews button- all information completed in the app since the last time Export reviews was pressed will not be saved
  7. User opens the app or clicks the Begin Review With Subbasin Selection (Retrieves Last Saved Result) button on the Watershed Selection tab to pull the last information saved on the server
  8. User proceeds to the Manual Review tab and continues work

4.3.5 Application Operation:
Station that Snapped to >1 WQS Segment

Should a user select a station for review that snapped to >1 WQS segment, the process for QAing that station is relatively simple. With the station highlighted with the green halo, the user reviews all the information presented below the map in the Stations Data and Spatially Joined WQS tab. Additionally, when the Plot stations that snapped to >1 WQS Segment button was pressed, additional spatial layers became available in the layers drop down of the map. The user may now turn on the layer called WQS Segments of Stations Snapped to >1 Segment to reveal all the segments in that category. These segments are plotted on a color ramp to distinguish them from one another and may be clicked on in the map to reveal the same information from the Selected Station Information table in a popup on the map. The GIS Web App hyperlinks may also be useful to use at this stage.

4.3.5.1 Application Operation: Manual WQS Adjustment (> 1 segment)

If the user needs to choose between multiple WQS segments, they must press the blue Manual WQS Adjustment button to reveal a modal window named Manually Adjust WQS. The user cannot accept a single StationID that is linked to multiple WQS_ID’s using the Accept button. The Manually Adjust WQS modal shows the information from the Selected Station Information table one last time for review. There are two textboxes below the table where users may input the desired WQS_ID to attach to the StationID and any additional comments and documentation to archive with the snapped StationID and WQS segment information. This could be why the selection was made, notes to future assessors about this site/segment, or nothing at all. At this point the user may navigate away from the modal by pressing Dismiss or Cancel if they choose to not proceed with that station.

It is highly recommended that users copy and paste the desired WQS_ID from either this application or the GIS Web App to ensure the correct WQS_ID information is entered into the text box.

Pro Tip: By clicking the record in the WQS_ID cell of the table on the modal window three time quickly, the cell contents will be highlighted. Users may use the keyboard shortcuts to copy(ctrl+c)/paste(ctrl+v) this information into the text box below the table.

Once the user presses the Accept button in the Manually Adjust WQS modal, the station and manually entered WQS segment are linked, the modal disappears, the station changes to a purple (completed) color, and the count of stations “to do” in that category at the top of the page goes down by one.

4.3.6 Application Operation:
Station that Snapped to 0 WQS Segments

Should a user select a station for review that snapped to no WQS segments, the process for QAing that station is more involved. With the station highlighted with the green halo, the user reviews all the information presented below the map in the Stations Data and Spatially Joined WQS tab. To preserve application rendering time, additional spatial layers are not available in the layers drop down of the map for this scenario. Instead, the user should click the Open Link In New Tab button in the Selected Station Information table to navigate to the GIS Web App to visualize all WQS segments around the station.

4.3.6.1 Application Operation: Manual WQS Adjustment (0 segments)

Once the user has chosen an appropriate WQS segment, they must press the blue Manual WQS Adjustment button to reveal a modal window named Manually Adjust WQS. The user cannot accept a single StationID that is linked to no WQS_ID’s using the Accept button. The Manually Adjust WQS modal shows the information from the Selected Station Information table one last time for review. This will show information about the station but no WQS_ID. There are two textboxes below the table where users may input the desired WQS_ID to attach to the StationID and any additional comments and documentation to archive with the snapped StationID and WQS segment information. This could be why the selection was made, notes to future assessors about this site/segment, or nothing at all. At this point the user may navigate away from the modal by pressing Dismiss or Cancel if they choose to not proceed with that station.

It is highly recommended that users copy and paste the desired WQS_ID from either this application or the GIS Web App to ensure the correct WQS_ID information is entered into the text box.

Once the user presses the Accept button in the Manually Adjust WQS modal, the station and manually entered WQS segment are linked, the modal disappears, the station changes to a purple (completed) color, and the count of stations “to do” in that category at the top of the page goes down by one.

4.3.7 Application Operation:
There are no more stations to attribute

Congratulations! That means you have completed the review of all the stations in this subbasin. Make sure you press the Export reviews button one more time to ensure all your hard work is saved on the server.

At this point, the user may navigate back to the Watershed selection tab to choose another subbasin to review or close the application to end their metadata review session.

4.4 Having problems?

Please contact Emma Jones () should you encounter any problems using the application, wonky situations to review, or long rendering times/app crashing.

5 Riverine Application How To

The Riverine Assessment App is located on the R server. update link to 2024 when pushed

6 Lacustrine Application How To

The Lakes Assessment App is located on the R server. update link to 2024 when pushed

7 Bioassessment Applications How To

There are two bioassessment tools designed to assist biological staff with the assessment process. They include the Bioassessment Dashboard and the Benthic Assessment Fact Sheet Tool. Both tools are helpful throughout the bioassessment process, providing standardized data queries, visualizations, and analyses for biologists. The CEDS Benthic Data Query Tool is a supplementary resource for more in depth benthic analyses, but that tool will not be discussed in depth in this chapter.

7.1 General Bioassessment Workflow

After benthic macroinvertebrate data collection, processing, identification, and CEDS data entry, regional biological staff should assess all samples collected within a given Integrated Report window and provide station-level summaries to regional assessment staff. Biologists are best suited to make biological assessment decisions due to their expertise in benthic community structure and because they visited the sites and thus have personal knowledge of the site, watershed, and sampling conditions that might affect how data should be interpreted.

Historically, biologists provided bioassessment decisions to regional assessment staff through individual bioassessment fact sheets. These one page documents summarize all benthic and habitat sampling data from a given site and are excellent resources for monitoring, assessment, and TMDL staff well after the fact sheet was written. Regional assessment staff rely on these fact sheets to relay expert interpretation of biological data readily into assessment decisions. These fact sheets improve communication between the monitoring and assessment arms of the programs and can help guide water planning efforts. However, these fact sheets were time consuming to create and difficult to keep track of during and after a given assessment cycle.

With the adoption of more automated methods throughout the assessment process, regional biologists received two automated tools to standardize and expedite the bioassessment process. The Bioassessment Dashboard is intended as a first stop for biologists as they begin bioassessment in a given cycle. The tool identifies all biological and habitat data collected within a given assessment cycle, along with station-specific bioassessment information from previous assessment cycles, to readily assist with biological assessment decisions. This interactive tool performs all necessary data querying, manipulation along with interactive mapping, plotting, and tabular data summaries.

When a station-level biological decision has been made, the regional biologists enter that decision and other metadata into a template (download a copy of the template here) {target="_blank"}. All fields must be completed in the template. When one or more bioassessment decisions are completed in the template, biologists may submit these decisions to be archived on the R server using the Benthic Assessment Fact Sheet Tool. This tool allows biologists to append their bioassessment decisions to a pinned dataset on the R server (with a point and click interface) and, once bioassessment decisions are pinned, they are able to generate and download a standardized biological fact sheet with the click of a button. There is no need to send these fact sheets to assessment staff and they have access to these fact sheets inside the Riverine Assessment Application as soon as the bioassessment data is pinned (archived) on the R server.

7.2 Application Use: Bioassessment Dashboard

7.2.1 Map Tab

On load, the main panel of the Bioassessment Dashboard displays all biological sampling locations from the current IR cycle on the Map tab. The map allows users to zoom and pan interactively with the mouse. In the top left corner there are a number of tools to aid users. The plus and minus buttons offer fixed zoom in/out. The home button returns the map orientation to the original zoom level in case a user gets lost. Additional layers available in the map (by hovering over the layers button on the left side) include US EPA Level 3 Ecoregions and Assessment Regions (regional assessment boundaries).

A table below the map displays station metadata information for all stations currently displayed in the map. All tables in the dashboard are interactive and columns may be sorted or turned on/off (using the Column Visibility button in the upper left corner) as well as copied to the user’s clipboard.

Before proceeding to subsequent tabs, it is beneficial to refine the working dataset by Basin, Collector, StationID, or Replicate filters available in the left side panel.

Pro Tip: These filters are cascading hierarchical filters. Starting from the top down, if a user chooses one or more Basins, then only the Collectors with data within these Basins are provided in the Collector Filter box. This logic is true of the StationID Filter box as well. Not every box must be filled in to utlize the heirarchical filtering.

Pro Tip: Once the user has set the query conditions to their liking, the hamburger button in the blue navigation bar will hide/show the sidebar. Hiding the sidebar can be helpful to maximize data visualization features in the browser window space.

7.2.2 SCI Scores Tab

Based on the query terms, the Stream Condition Index (SCI) scores will appear in an interactive plot. The plot is composed of layers of replicate number by season by sampling method (e.g.Rep 1 Fall (Boatable) or Rep 2 Spring (Riffle)). Additionally, both VSCI and VCPMI criteria are plotted as layers. All layers may be turned on/off by clicking on them in the legend in the left hand side of the plot. Helpful interactive features of the plot include hovering over samples to reveal a popup box with additional sampling information, downloading the plot as a png, zooming to areas of the plot with mouse selection, autoscaling, resetting the axes, and more. Hover your mouse over the top right of the plot to reveal these features and more.

A table below the plot displays benthic sample information for all stations currently displayed in the plot. Sample information, SCI scores, and SCI metrics can be found for each sample in this table. All tables in the dashboard are interactive and columns may be sorted or turned on/off (using the Column Visibility button in the upper left corner) as well as copied to the user’s clipboard.

7.2.2.1 SCI Method Decision

SCI scores are provided for each station based on the station location to expedite application rendering time. Basin and Level 3 Ecoregion information spatially derived from the sample latitude and longitude provide the VCMPI method determination. Should you wish to investigate other SCI methods for a given station, the CEDS Benthic Data Query Tool provides higher level of analytical control.

7.2.3 Habitat Scores Tab

Based on the query terms, the Total Habitat Scores will appear in an interactive plot. The plot is composed of layers of season by sampling method (e.g.Fall (High Gradient Method) or Spring (Low Gradient Method)). Additionally, the Benthic Stressor Analysis thresholds indicating probability of stress to aquatic life are plotted as layers. All layers may be turned on/off by clicking on them in the legend in the left hand side of the plot. Helpful interactive features of the plot include hovering over samples to reveal a popup box with additional sampling information, downloading the plot as a png, zooming to areas of the plot with mouse selection, autoscaling, resetting the axes, and more. Hover your mouse over the top right of the plot to reveal these features and more.

A table below the plot displays habitat sample information for all stations currently displayed in the plot. Sample information, individual habitat metric scores, and Total Habitat Scores can be found for each sample in this table. The cells are color coded in a red scheme where darker colors highlight lower scores for a given habitat metric. If a metric is <= 10, the metric is bolded. All tables in the dashboard are interactive and columns may be sorted or turned on/off (using the Column Visibility button in the upper left corner) as well as copied to the user’s clipboard.

7.2.4 Station Summary Tab

Based on the query terms, the SCI Summary Table will display summaries of the SCI averages across various windows. Where data are available, these windows include the entire IR window (6 year average), most recent two years of the IR period average, spring sample average, fall sample average, and individual year averages. The number of samples in each window is provided in the n Samples field. All tables in the dashboard are interactive and columns may be sorted or turned on/off (using the Column Visibility button in the upper left corner) as well as copied to the user’s clipboard.

The Total Habitat Summary Table will display summaries of the Total Habitat Scores averages across various windows. Where data are available, these windows include the entire IR window (6 year average), most recent two years of the IR period average, spring sample average, fall sample average, and individual year averages. The number of samples in each window is provided in the n Samples field. All tables in the dashboard are interactive and columns may be sorted or turned on/off (using the Column Visibility button in the upper left corner) as well as copied to the user’s clipboard.

The Previous Integrated Report Cycle Bioassessment Information Table will display bioassessment decision information from previous IR windows, where this information is available in the standardized archived format. This information can be helpful to review when making biological assessments to understand any historical decisions.

7.2.5 Assessment Decision Tab

Based on the query terms, the Assessment Decision Summary Table displays bioassessment information for the current IR cycle. If you have uploaded information about the chosen station(s) to the VDEQ Benthic Assessment Fact Sheet Tool, you will see information that is saved on the server in the table below. If you upload new data about the selected station(s) to the VDEQ Benthic Assessment Fact Sheet Tool, that information will be reflect here after you refresh the dashboard.

7.3 Application Use: Benthic Assessment Fact Sheet Tool

7.3.1 Assessment Fact Sheet Tab

7.3.2 General Purpose Biological Fact Sheet Tab